- Decision Trees
- Collecting/Importing Data
- Exploring and Preparing the Data
- Partition data into training and test datasets
- Training a Decision Tree Model
- Evaluating Model Performance
- Improving Decision Tree Accuracy
- Regression Trees
There are numerous implementations of decision trees, but one of the most well-known implementations is the C5.0 algorithm.
The C5.0 algorithm has become the industry standard to produce decision trees, because it does well for most types of problems directly out of the box.
Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while the others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory.
In this section, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.
The idea behind our credit model is to identify factors that are predictive of higher risk of default.
We will first download and import the credit data credit, which contains information on loans obtained from a credit agency in Germany.
The credit dataset includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default (fails to pay the principal and interests). Let’s see whether we can determine any patterns that predict this outcome.
# load data
library(tidyverse)
credit <- read_csv("credit.csv")
We can first take a quick look at the dataset.
str(credit)
## spc_tbl_ [1,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ checking_balance : chr [1:1000] "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ... ## $ months_loan_duration: num [1:1000] 6 48 12 42 24 36 24 36 12 30 ... ## $ credit_history : chr [1:1000] "critical" "good" "critical" "good" ... ## $ purpose : chr [1:1000] "furniture/appliances" "furniture/appliances" "education" "furniture/appliances" ... ## $ amount : num [1:1000] 1169 5951 2096 7882 4870 ... ## $ savings_balance : chr [1:1000] "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ... ## $ employment_duration : chr [1:1000] "> 7 years" "1 - 4 years" "4 - 7 years" "4 - 7 years" ... ## $ percent_of_income : num [1:1000] 4 2 2 2 3 2 3 2 2 4 ... ## $ years_at_residence : num [1:1000] 4 2 3 4 4 4 4 2 4 2 ... ## $ age : num [1:1000] 67 22 49 45 53 35 53 35 61 28 ... ## $ other_credit : chr [1:1000] "none" "none" "none" "none" ... ## $ housing : chr [1:1000] "own" "own" "own" "other" ... ## $ existing_loans_count: num [1:1000] 2 1 1 1 2 1 1 1 1 2 ... ## $ job : chr [1:1000] "skilled" "skilled" "unskilled" "skilled" ... ## $ dependents : num [1:1000] 1 1 2 2 2 2 1 1 1 1 ... ## $ phone : chr [1:1000] "yes" "no" "no" "no" ... ## $ default : chr [1:1000] "no" "yes" "no" "no" ... ## - attr(*, "spec")= ## .. cols( ## .. checking_balance = col_character(), ## .. months_loan_duration = col_double(), ## .. credit_history = col_character(), ## .. purpose = col_character(), ## .. amount = col_double(), ## .. savings_balance = col_character(), ## .. employment_duration = col_character(), ## .. percent_of_income = col_double(), ## .. years_at_residence = col_double(), ## .. age = col_double(), ## .. other_credit = col_character(), ## .. housing = col_character(), ## .. existing_loans_count = col_double(), ## .. job = col_character(), ## .. dependents = col_double(), ## .. phone = col_character(), ## .. default = col_character() ## .. ) ## - attr(*, "problems")=<externalptr>
Let’s take a look at the table() output for a couple of loan features that seem likely to predict a default.
# look at two characteristics of the applicant.
table(credit$checking_balance)
## ## < 0 DM > 200 DM 1 - 200 DM unknown ## 274 63 269 394
table(credit$savings_balance)
## ## < 100 DM > 1000 DM 100 - 500 DM 500 - 1000 DM unknown ## 603 48 103 63 183
#Note the data was from Germany, currency unit is Deutsche Marks (DM).
Some of the loan’s features are numeric, such as its duration and the amount of credit requested:
summary(credit$months_loan_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.0 12.0 18.0 20.9 24.0 72.0
summary(credit$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 250 1366 2320 3271 3972 18424
The default indicates the outcome, indicating whether the loan applicant was unable to meet the agreed payment.
table(credit$default) #30% went into default
## ## no yes ## 700 300
We will split our data into two portions:
a training dataset to build the decision tree (90%)
a test dataset to evaluate the model performance (10%)
RNGversion("3.5.2") # use an older random number generator to match the book
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
set.seed(123) # use set.seed to use the same random number sequence as the tutorial train_sample <- sample(1000, 900)
The resulting train_sample is a vector of 900 random integers, which we can used as index to spilt training and testing datasets.
str(train_sample)
## int [1:900] 288 788 409 881 937 46 525 887 548 453 ...
# split the data frames credit_train <- credit[train_sample, ] credit_test <- credit[-train_sample, ]
Then we can check whether the splited training and testing datasets are balanced.
# check the proportion of class variable prop.table(table(credit_train$default))
## ## no yes ## 0.7033333 0.2966667
prop.table(table(credit_test$default))
## ## no yes ## 0.67 0.33
This appears to a fairly even split, so we can now build our decision tree.
We will use the C5.0 algorithm in the C50 package to train our decision tree model. We can first install then load package C50.
#install.packages("C50")
library(C50)
## Warning: package 'C50' was built under R version 4.4.3
For the first iteration of our credit approval model, we’ll use the default C5.0 configuration, as shown in the following code.
The 17th column in credit_train is the default class variable, so we need to exclude it from the training data frame.
#The algorithm needs a factor type if the outcome variable as an input to the function, credit_train$default<-as.factor(credit_train$default) credit_test$default<-as.factor(credit_test$default) #Train a Decision Tree Model credit_model <- C5.0(credit_train[-17], credit_train$default)
Then we can see the detailed branches of the decision tree by summary().
# display simple facts about the tree credit_model
## ## Call: ## C5.0.default(x = credit_train[-17], y = credit_train$default) ## ## Classification Tree ## Number of samples: 900 ## Number of predictors: 16 ## ## Tree size: 57 ## ## Non-standard options: attempt to group attributes
# display detailed information about the tree summary(credit_model)
##
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default)
##
##
## C5.0 [Release 2.07 GPL Edition] Tue Apr 1 18:24:23 2025
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 900 cases (17 attributes) from undefined.data
##
## Decision tree:
##
## checking_balance in {unknown,> 200 DM}: no (412/50)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history in {perfect,very good}: yes (59/18)
## credit_history in {poor,critical,good}:
## :...months_loan_duration <= 22:
## :...credit_history = critical: no (72/14)
## : credit_history = poor:
## : :...dependents > 1: no (5)
## : : dependents <= 1:
## : : :...years_at_residence <= 3: yes (4/1)
## : : years_at_residence > 3: no (5/1)
## : credit_history = good:
## : :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (15/1)
## : savings_balance = 100 - 500 DM:
## : :...other_credit in {none,store}: no (9/2)
## : : other_credit = bank: yes (3)
## : savings_balance = unknown:
## : :...other_credit in {none,store}: no (21/8)
## : : other_credit = bank: yes (1)
## : savings_balance = < 100 DM:
## : :...purpose in {car0,business,renovations}: no (8/2)
## : purpose = education:
## : :...checking_balance = 1 - 200 DM: no (1)
## : : checking_balance = < 0 DM: yes (4)
## : purpose = car:
## : :...employment_duration = unemployed: no (4/1)
## : : employment_duration = > 7 years: yes (5)
## : : employment_duration = 4 - 7 years:
## : : :...amount <= 1680: yes (2)
## : : : amount > 1680: no (3)
## : : employment_duration = 1 - 4 years:
## : : :...years_at_residence <= 2: yes (2)
## : : : years_at_residence > 2: no (6/1)
## : : employment_duration = < 1 year:
## : : :...years_at_residence <= 2: yes (5)
## : : years_at_residence > 2: no (3/1)
## : purpose = furniture/appliances:
## : :...job in {management,unskilled}: no (23/3)
## : job = unemployed: yes (1)
## : job = skilled:
## : :...months_loan_duration > 13: [S1]
## : months_loan_duration <= 13:
## : :...housing in {other,own}: no (23/4)
## : housing = rent:
## : :...percent_of_income <= 3: yes (3)
## : percent_of_income > 3: no (2)
## months_loan_duration > 22:
## :...savings_balance = 500 - 1000 DM: yes (4/1)
## savings_balance = > 1000 DM: no (2)
## savings_balance = 100 - 500 DM:
## :...credit_history in {poor,critical}: no (14/3)
## : credit_history = good:
## : :...other_credit in {none,store}: yes (12/2)
## : other_credit = bank: no (1)
## savings_balance = unknown:
## :...checking_balance = 1 - 200 DM: no (17)
## : checking_balance = < 0 DM:
## : :...credit_history = critical: no (1)
## : credit_history in {poor,good}: yes (12/3)
## savings_balance = < 100 DM:
## :...months_loan_duration > 47: yes (21/2)
## months_loan_duration <= 47:
## :...housing = other:
## :...percent_of_income <= 2: no (6)
## : percent_of_income > 2: yes (9/3)
## housing = rent:
## :...other_credit in {none,store}: yes (16/3)
## : other_credit = bank: no (1)
## housing = own:
## :...employment_duration = > 7 years: no (13/4)
## employment_duration = unemployed:
## :...years_at_residence <= 2: yes (4)
## : years_at_residence > 2: no (3)
## employment_duration = 4 - 7 years:
## :...job in {management,skilled,
## : : unemployed}: yes (9/1)
## : job = unskilled: no (1)
## employment_duration = 1 - 4 years:
## :...purpose in {car0,business,education}: yes (7/1)
## : purpose in {furniture/appliances,
## : : renovations}: no (7)
## : purpose = car:
## : :...years_at_residence <= 3: yes (3)
## : years_at_residence > 3: no (3)
## employment_duration = < 1 year:
## :...years_at_residence > 3: yes (5)
## years_at_residence <= 3:
## :...other_credit = bank: no (0)
## other_credit = store: yes (1)
## other_credit = none:
## :...checking_balance = 1 - 200 DM: no (8/2)
## checking_balance = < 0 DM:
## :...job in {management,skilled,
## : unemployed}: yes (2)
## job = unskilled: no (3/1)
##
## SubTree [S1]
##
## employment_duration in {unemployed,> 7 years,1 - 4 years}: yes (10)
## employment_duration in {4 - 7 years,< 1 year}: no (4)
##
##
## Evaluation on training data (900 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 56 133(14.8%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 598 35 (a): class no
## 98 169 (b): class yes
##
##
## Attribute usage:
##
## 100.00% checking_balance
## 54.22% credit_history
## 47.67% months_loan_duration
## 38.11% savings_balance
## 14.33% purpose
## 14.33% housing
## 12.56% employment_duration
## 9.00% job
## 8.67% other_credit
## 6.33% years_at_residence
## 2.22% percent_of_income
## 1.56% dependents
## 0.56% amount
##
##
## Time: 0.0 secs
The preceding output shows some of the first branches in the decision tree. The first three lines could be represented in plain language as:
If the checking account balance is unknown or greater than 200 DM, then classify as “not likely to default.”
Otherwise, if the checking account balance is less than zero DM or between one and 200 DM.
And the credit history is perfect or very good, then classify as “likely to default.”
The numbers in parentheses indicate the number of examples meeting the criteria for that decision, and the number incorrectly classified by the decision.
For instance, on the first line, 412/50 indicates that of the 412 examples reaching the decision, 50 were incorrectly classified as not likely to default.
Sometimes a tree results in decisions that make little logical sense. They might reflect a real pattern in the data, or they may be a statistical anomaly.
Decision trees are known for having a tendency to overfit the model to the training data. For this reason, the error rate reported on training data may be overly optimistic, and it is especially important to evaluate decision trees on a test dataset.
To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:
# create a factor vector of predictions on test data credit_pred <- predict(credit_model, credit_test)
Then we can see how well did the model do for the testing data. A model’s performance is often worse on unseen data.
# cross tabulation of predicted versus actual classes
#install.packages("gmodels")
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.4.3
CrossTable(credit_test$default, credit_pred,
prop.chisq = FALSE,
prop.c = FALSE,
prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
## ## ## Cell Contents ## |-------------------------| ## | N | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 100 ## ## ## | predicted default ## actual default | no | yes | Row Total | ## ---------------|-----------|-----------|-----------| ## no | 59 | 8 | 67 | ## | 0.590 | 0.080 | | ## ---------------|-----------|-----------|-----------| ## yes | 19 | 14 | 33 | ## | 0.190 | 0.140 | | ## ---------------|-----------|-----------|-----------| ## Column Total | 78 | 22 | 100 | ## ---------------|-----------|-----------|-----------| ## ##
Our model’s error rate is likely to be too high to deploy it in a real-time credit scoring application.
Boosting: One way the C5.0 algorithm can improve accuracy is through adaptive boosting. This is a process in which many decision trees are built and the trees vote on the best class for each example.
The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team.
# boosted decision tree with 10 trials
credit_boost10 <- C5.0(credit_train[-17],
credit_train$default,
trials = 10)
credit_boost10
## ## Call: ## C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10) ## ## Classification Tree ## Number of samples: 900 ## Number of predictors: 16 ## ## Number of boosting iterations: 10 ## Average tree size: 47.5 ## ## Non-standard options: attempt to group attributes
summary(credit_boost10)
##
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)
##
##
## C5.0 [Release 2.07 GPL Edition] Tue Apr 1 18:24:23 2025
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 900 cases (17 attributes) from undefined.data
##
## ----- Trial 0: -----
##
## Decision tree:
##
## checking_balance in {unknown,> 200 DM}: no (412/50)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history in {perfect,very good}: yes (59/18)
## credit_history in {poor,critical,good}:
## :...months_loan_duration <= 22:
## :...credit_history = critical: no (72/14)
## : credit_history = poor:
## : :...dependents > 1: no (5)
## : : dependents <= 1:
## : : :...years_at_residence <= 3: yes (4/1)
## : : years_at_residence > 3: no (5/1)
## : credit_history = good:
## : :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (15/1)
## : savings_balance = 100 - 500 DM:
## : :...other_credit in {none,store}: no (9/2)
## : : other_credit = bank: yes (3)
## : savings_balance = unknown:
## : :...other_credit in {none,store}: no (21/8)
## : : other_credit = bank: yes (1)
## : savings_balance = < 100 DM:
## : :...purpose in {car0,business,renovations}: no (8/2)
## : purpose = education:
## : :...checking_balance = 1 - 200 DM: no (1)
## : : checking_balance = < 0 DM: yes (4)
## : purpose = car:
## : :...employment_duration = unemployed: no (4/1)
## : : employment_duration = > 7 years: yes (5)
## : : employment_duration = 4 - 7 years:
## : : :...amount <= 1680: yes (2)
## : : : amount > 1680: no (3)
## : : employment_duration = 1 - 4 years:
## : : :...years_at_residence <= 2: yes (2)
## : : : years_at_residence > 2: no (6/1)
## : : employment_duration = < 1 year:
## : : :...years_at_residence <= 2: yes (5)
## : : years_at_residence > 2: no (3/1)
## : purpose = furniture/appliances:
## : :...job in {management,unskilled}: no (23/3)
## : job = unemployed: yes (1)
## : job = skilled:
## : :...months_loan_duration > 13: [S1]
## : months_loan_duration <= 13:
## : :...housing in {other,own}: no (23/4)
## : housing = rent:
## : :...percent_of_income <= 3: yes (3)
## : percent_of_income > 3: no (2)
## months_loan_duration > 22:
## :...savings_balance = 500 - 1000 DM: yes (4/1)
## savings_balance = > 1000 DM: no (2)
## savings_balance = 100 - 500 DM:
## :...credit_history in {poor,critical}: no (14/3)
## : credit_history = good:
## : :...other_credit in {none,store}: yes (12/2)
## : other_credit = bank: no (1)
## savings_balance = unknown:
## :...checking_balance = 1 - 200 DM: no (17)
## : checking_balance = < 0 DM:
## : :...credit_history = critical: no (1)
## : credit_history in {poor,good}: yes (12/3)
## savings_balance = < 100 DM:
## :...months_loan_duration > 47: yes (21/2)
## months_loan_duration <= 47:
## :...housing = other:
## :...percent_of_income <= 2: no (6)
## : percent_of_income > 2: yes (9/3)
## housing = rent:
## :...other_credit in {none,store}: yes (16/3)
## : other_credit = bank: no (1)
## housing = own:
## :...employment_duration = > 7 years: no (13/4)
## employment_duration = unemployed:
## :...years_at_residence <= 2: yes (4)
## : years_at_residence > 2: no (3)
## employment_duration = 4 - 7 years:
## :...job in {management,skilled,
## : : unemployed}: yes (9/1)
## : job = unskilled: no (1)
## employment_duration = 1 - 4 years:
## :...purpose in {car0,business,education}: yes (7/1)
## : purpose in {furniture/appliances,
## : : renovations}: no (7)
## : purpose = car:
## : :...years_at_residence <= 3: yes (3)
## : years_at_residence > 3: no (3)
## employment_duration = < 1 year:
## :...years_at_residence > 3: yes (5)
## years_at_residence <= 3:
## :...other_credit = bank: no (0)
## other_credit = store: yes (1)
## other_credit = none:
## :...checking_balance = 1 - 200 DM: no (8/2)
## checking_balance = < 0 DM:
## :...job in {management,skilled,
## : unemployed}: yes (2)
## job = unskilled: no (3/1)
##
## SubTree [S1]
##
## employment_duration in {unemployed,> 7 years,1 - 4 years}: yes (10)
## employment_duration in {4 - 7 years,< 1 year}: no (4)
##
## ----- Trial 1: -----
##
## Decision tree:
##
## checking_balance = unknown:
## :...other_credit in {bank,store}:
## : :...purpose in {car0,furniture/appliances}: no (24.8/6.6)
## : : purpose in {business,education,renovations}: yes (19.5/6.3)
## : : purpose = car:
## : : :...dependents <= 1: yes (20.1/4.8)
## : : dependents > 1: no (2.4)
## : other_credit = none:
## : :...credit_history in {critical,perfect,very good}: no (102.8/4.4)
## : credit_history = good:
## : :...existing_loans_count <= 1: no (112.7/17.5)
## : : existing_loans_count > 1: yes (18.9/7.9)
## : credit_history = poor:
## : :...years_at_residence <= 1: yes (4.4)
## : years_at_residence > 1:
## : :...percent_of_income <= 3: no (11.9)
## : percent_of_income > 3: yes (14.3/5.6)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (42.9/11.3)
## savings_balance = unknown:
## :...credit_history in {poor,perfect}: no (8.5)
## : credit_history in {critical,good,very good}:
## : :...employment_duration in {unemployed,> 7 years,4 - 7 years,
## : : < 1 year}: no (52.3/17.3)
## : employment_duration = 1 - 4 years: yes (19.7/5.6)
## savings_balance = 100 - 500 DM:
## :...existing_loans_count > 3: yes (3)
## : existing_loans_count <= 3:
## : :...credit_history in {poor,critical,very good}: no (24.6/7.6)
## : credit_history = perfect: yes (2.4)
## : credit_history = good:
## : :...months_loan_duration <= 27: no (23.7/10.5)
## : months_loan_duration > 27: yes (5.6)
## savings_balance = < 100 DM:
## :...months_loan_duration > 42: yes (28/5.2)
## months_loan_duration <= 42:
## :...percent_of_income <= 2:
## :...employment_duration in {unemployed,4 - 7 years,
## : : 1 - 4 years}: no (86.2/23.8)
## : employment_duration in {> 7 years,< 1 year}:
## : :...housing = other: no (4.8/1.6)
## : housing = rent: yes (10.7/2.4)
## : housing = own:
## : :...phone = yes: yes (12.9/4)
## : phone = no:
## : :...percent_of_income <= 1: no (7.1/0.8)
## : percent_of_income > 1: yes (17.5/7.1)
## percent_of_income > 2:
## :...years_at_residence <= 1: no (31.6/8.5)
## years_at_residence > 1:
## :...credit_history in {poor,perfect}: yes (20.9/1.6)
## credit_history in {critical,good,very good}:
## :...job = skilled: yes (95/34.7)
## job = unemployed: no (1.6)
## job = management:
## :...amount <= 11590: no (23.8/7)
## : amount > 11590: yes (3.8)
## job = unskilled:
## :...checking_balance = 1 - 200 DM: no (17.9/6.2)
## checking_balance in {> 200 DM,
## < 0 DM}: yes (23.8/9.5)
##
## ----- Trial 2: -----
##
## Decision tree:
##
## checking_balance = unknown:
## :...other_credit = bank:
## : :...existing_loans_count > 2: no (3.3)
## : : existing_loans_count <= 2:
## : : :...months_loan_duration <= 8: no (4)
## : : months_loan_duration > 8: yes (43/16.6)
## : other_credit in {none,store}:
## : :...employment_duration in {unemployed,< 1 year}:
## : :...purpose in {business,renovations}: yes (6.4)
## : : purpose in {car0,car,education}: no (13.2)
## : : purpose = furniture/appliances:
## : : :...amount <= 4594: no (22.5/7.3)
## : : amount > 4594: yes (9.1)
## : employment_duration in {> 7 years,4 - 7 years,1 - 4 years}:
## : :...percent_of_income <= 3: no (92.7/3.6)
## : percent_of_income > 3:
## : :...age > 30: no (73.6/5.5)
## : age <= 30:
## : :...job in {management,unskilled,unemployed}: yes (14/4)
## : job = skilled:
## : :...credit_history = very good: no (0)
## : credit_history = poor: yes (3.6)
## : credit_history in {critical,good,perfect}:
## : :...age <= 29: no (20.4/4.6)
## : age > 29: yes (2.7)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...housing = other:
## :...dependents > 1: yes (28.3/7.6)
## : dependents <= 1:
## : :...employment_duration in {unemployed,4 - 7 years,
## : : < 1 year}: no (22.9/4.5)
## : employment_duration in {> 7 years,1 - 4 years}: yes (29.6/10.5)
## housing = rent:
## :...credit_history = poor: no (7.1/0.7)
## : credit_history = perfect: yes (5.3)
## : credit_history in {critical,good,very good}:
## : :...employment_duration in {unemployed,> 7 years,
## : : 4 - 7 years}: no (33.9/12.3)
## : employment_duration = < 1 year: yes (28.3/9.3)
## : employment_duration = 1 - 4 years:
## : :...checking_balance = > 200 DM: no (2)
## : checking_balance in {1 - 200 DM,< 0 DM}:
## : :...years_at_residence <= 3: no (10.3/3.8)
## : years_at_residence > 3: yes (20.4/3.1)
## housing = own:
## :...job in {management,unemployed}: yes (55.8/19.8)
## job in {skilled,unskilled}:
## :...months_loan_duration <= 7: no (25.3/2)
## months_loan_duration > 7:
## :...years_at_residence > 3: no (92.2/29.6)
## years_at_residence <= 3:
## :...purpose = renovations: yes (7/1.3)
## purpose in {car0,business,education}: no (32.2/5.3)
## purpose = car:
## :...months_loan_duration > 40: no (7.2/0.7)
## : months_loan_duration <= 40:
## : :...amount <= 947: yes (12.9)
## : amount > 947:
## : :...months_loan_duration <= 16: no (23.2/8.5)
## : months_loan_duration > 16: [S1]
## purpose = furniture/appliances:
## :...savings_balance in {100 - 500 DM,
## : 500 - 1000 DM}: yes (14.6/4.5)
## savings_balance in {unknown,> 1000 DM}: no (15.4/3.2)
## savings_balance = < 100 DM:
## :...months_loan_duration > 36: yes (7.1)
## months_loan_duration <= 36:
## :...existing_loans_count > 1: no (14.1/4.3)
## existing_loans_count <= 1: [S2]
##
## SubTree [S1]
##
## savings_balance = 100 - 500 DM: no (4.5/0.7)
## savings_balance in {unknown,500 - 1000 DM,< 100 DM,> 1000 DM}: yes (22.5/2.7)
##
## SubTree [S2]
##
## checking_balance in {1 - 200 DM,> 200 DM}: yes (46.7/20)
## checking_balance = < 0 DM: no (22.4/9.1)
##
## ----- Trial 3: -----
##
## Decision tree:
##
## checking_balance in {unknown,> 200 DM}:
## :...employment_duration = unemployed: yes (16/6.7)
## : employment_duration = > 7 years: no (98.9/17.1)
## : employment_duration = 4 - 7 years:
## : :...checking_balance = > 200 DM: yes (9.6/3.6)
## : : checking_balance = unknown:
## : : :...age <= 22: yes (6.5/1.6)
## : : age > 22: no (42.6/1.5)
## : employment_duration = < 1 year:
## : :...amount <= 1333: no (11.7)
## : : amount > 1333:
## : : :...amount <= 6681: no (38.2/16.3)
## : : amount > 6681: yes (5.3)
## : employment_duration = 1 - 4 years:
## : :...percent_of_income <= 1: no (20.6/1.5)
## : percent_of_income > 1:
## : :...job in {skilled,unemployed}: no (64.9/17.6)
## : job in {management,unskilled}:
## : :...existing_loans_count > 2: yes (2.4)
## : existing_loans_count <= 2:
## : :...age <= 34: yes (26.4/10.7)
## : age > 34: no (10.5)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (35.8/12)
## savings_balance = 100 - 500 DM:
## :...amount <= 1285: yes (12.8/0.5)
## : amount > 1285:
## : :...existing_loans_count <= 1: no (27/9.2)
## : existing_loans_count > 1: yes (15.8/4.9)
## savings_balance = unknown:
## :...credit_history in {poor,critical,perfect}: no (15.5)
## : credit_history in {good,very good}:
## : :...age > 56: no (4.5)
## : age <= 56:
## : :...months_loan_duration <= 18: yes (24.5/5.6)
## : months_loan_duration > 18: no (28.4/12.3)
## savings_balance = < 100 DM:
## :...months_loan_duration <= 11:
## :...job = management: yes (13.7/4.9)
## : job in {skilled,unskilled,unemployed}: no (45.9/10)
## months_loan_duration > 11:
## :...percent_of_income <= 1:
## :...credit_history in {poor,critical,very good}: no (11.1)
## : credit_history in {good,perfect}: yes (24.4/11)
## percent_of_income > 1:
## :...job = unemployed: yes (7/3.1)
## job = management:
## :...years_at_residence <= 1: no (6.6)
## : years_at_residence > 1:
## : :...checking_balance = 1 - 200 DM: yes (15.8/4)
## : checking_balance = < 0 DM: no (23.1/7)
## job = unskilled:
## :...housing in {other,rent}: yes (12.2/2.2)
## : housing = own:
## : :...purpose = car: yes (18.1/3.9)
## : purpose in {car0,furniture/appliances,business,
## : education,renovations}: no (32.1/11.1)
## job = skilled:
## :...checking_balance = 1 - 200 DM:
## :...months_loan_duration > 36: yes (6.5)
## : months_loan_duration <= 36:
## : :...other_credit in {bank,store}: yes (8/1.5)
## : other_credit = none:
## : :...dependents > 1: yes (7.4/3.1)
## : dependents <= 1:
## : :...percent_of_income <= 2: no (12.7/1.1)
## : percent_of_income > 2: [S1]
## checking_balance = < 0 DM:
## :...credit_history in {poor,very good}: yes (16.6)
## credit_history in {critical,good,perfect}:
## :...purpose in {car0,business,education,
## : renovations}: yes (10.2/1.5)
## purpose = car:
## :...age <= 51: yes (34.6/8.1)
## : age > 51: no (4.4)
## purpose = furniture/appliances:
## :...years_at_residence <= 1: no (4.4)
## years_at_residence > 1:
## :...other_credit = bank: yes (2.4)
## other_credit = store: no (0.5)
## other_credit = none:
## :...amount <= 1743: no (11.5/2.4)
## amount > 1743: yes (29/6.6)
##
## SubTree [S1]
##
## purpose in {car0,car,furniture/appliances,education}: no (19.8/6.1)
## purpose in {business,renovations}: yes (3.9)
##
## ----- Trial 4: -----
##
## Decision tree:
##
## checking_balance in {unknown,> 200 DM}:
## :...other_credit = store: no (20.6/9.6)
## : other_credit = none:
## : :...employment_duration in {unemployed,> 7 years,4 - 7 years,
## : : : 1 - 4 years}: no (211.3/45.7)
## : : employment_duration = < 1 year:
## : : :...amount <= 1333: no (8.8)
## : : amount > 1333:
## : : :...purpose = car: no (4.9)
## : : purpose in {car0,furniture/appliances,business,education,
## : : renovations}: yes (32.9/8.1)
## : other_credit = bank:
## : :...age > 44: no (14.4/1.2)
## : age <= 44:
## : :...years_at_residence <= 1: no (5)
## : years_at_residence > 1:
## : :...housing = rent: yes (4.3)
## : housing in {other,own}:
## : :...job = unemployed: yes (0)
## : job = management: no (4)
## : job in {skilled,unskilled}:
## : :...age <= 26: no (3.7)
## : age > 26:
## : :...savings_balance in {100 - 500 DM,
## : : > 1000 DM}: no (4)
## : savings_balance in {unknown,500 - 1000 DM,
## : < 100 DM}: yes (30.6/7.4)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history = perfect:
## :...housing in {other,rent}: yes (7.8)
## : housing = own: no (20.5/9)
## credit_history = poor:
## :...checking_balance = < 0 DM: yes (10.4/2.2)
## : checking_balance = 1 - 200 DM:
## : :...other_credit in {none,bank}: no (24/4.3)
## : other_credit = store: yes (5.8/1.2)
## credit_history = very good:
## :...age <= 23: no (5.7)
## : age > 23:
## : :...months_loan_duration <= 27: yes (28.4/3.7)
## : months_loan_duration > 27: no (6.9/2)
## credit_history = critical:
## :...years_at_residence <= 1: no (6.7)
## : years_at_residence > 1:
## : :...purpose in {car0,car,business,renovations}: no (62.2/21.9)
## : purpose = education: yes (7.9/0.9)
## : purpose = furniture/appliances:
## : :...phone = yes: no (14.5/2.8)
## : phone = no:
## : :...amount <= 1175: no (5.2)
## : amount > 1175: yes (30.1/7.6)
## credit_history = good:
## :...savings_balance = 100 - 500 DM: yes (32.1/11.7)
## savings_balance in {500 - 1000 DM,> 1000 DM}: no (15.7/4.7)
## savings_balance = unknown:
## :...job = unskilled: no (4.4)
## : job in {management,skilled,unemployed}:
## : :...checking_balance = 1 - 200 DM: no (26.8/10.4)
## : checking_balance = < 0 DM: yes (27.8/6)
## savings_balance = < 100 DM:
## :...dependents > 1:
## :...existing_loans_count > 1: no (2.6/0.4)
## : existing_loans_count <= 1:
## : :...years_at_residence <= 2: yes (10.2/2.9)
## : years_at_residence > 2: no (20.4/5.9)
## dependents <= 1:
## :...purpose in {car0,business}: no (9.7/2.5)
## purpose in {education,renovations}: yes (13/5.1)
## purpose = car:
## :...employment_duration in {unemployed,
## : : 1 - 4 years}: no (24.9/9)
## : employment_duration in {> 7 years,4 - 7 years,
## : < 1 year}: yes (32/8.3)
## purpose = furniture/appliances:
## :...months_loan_duration > 39: yes (4.8)
## months_loan_duration <= 39:
## :...phone = yes: yes (21.9/9.2)
## phone = no:
## :...employment_duration = unemployed: yes (3.3/0.4)
## employment_duration in {> 7 years,4 - 7 years,
## : < 1 year}: no (34.1/8.1)
## employment_duration = 1 - 4 years:
## :...percent_of_income <= 1: yes (3.8)
## percent_of_income > 1:
## :...months_loan_duration > 21: no (4.9/0.4)
## months_loan_duration <= 21:
## :...years_at_residence <= 3: no (20.9/8.8)
## years_at_residence > 3: yes (5.8)
##
## ----- Trial 5: -----
##
## Decision tree:
##
## checking_balance = unknown:
## :...other_credit = store: yes (16.9/7.5)
## : other_credit = bank:
## : :...housing = other: no (8.3/1.8)
## : : housing = rent: yes (4.4/0.8)
## : : housing = own:
## : : :...phone = yes: yes (12.1/5)
## : : phone = no: no (26.9/9.7)
## : other_credit = none:
## : :...credit_history in {critical,perfect,very good}: no (60.4/5.1)
## : credit_history in {poor,good}:
## : :...purpose in {car0,car,business,education}: no (53.6/12.8)
## : purpose = renovations: yes (7.3/1.1)
## : purpose = furniture/appliances:
## : :...job = unemployed: no (0)
## : job in {management,unskilled}: yes (19.2/7)
## : job = skilled:
## : :...phone = yes: no (14.6/1.8)
## : phone = no:
## : :...age > 32: no (9.2)
## : age <= 32:
## : :...employment_duration = 1 - 4 years: no (4.1)
## : employment_duration in {unemployed,> 7 years,
## : : 4 - 7 years,< 1 year}:
## : :...savings_balance in {100 - 500 DM,
## : : < 100 DM}: yes (20.5/3)
## : savings_balance in {unknown,500 - 1000 DM,
## : > 1000 DM}: no (3.4)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...percent_of_income <= 2:
## :...amount > 11054: yes (14.2/1.2)
## : amount <= 11054:
## : :...other_credit = bank: no (32.3/9.7)
## : other_credit = store: yes (8.9/2.6)
## : other_credit = none:
## : :...purpose in {car0,education}: no (8.4/3.7)
## : purpose in {business,renovations}: yes (20.3/9.1)
## : purpose = car:
## : :...savings_balance = 100 - 500 DM: yes (13.8/3.3)
## : : savings_balance in {unknown,500 - 1000 DM,< 100 DM,
## : : > 1000 DM}: no (46.6/7.9)
## : purpose = furniture/appliances:
## : :...employment_duration in {unemployed,
## : : 1 - 4 years}: yes (50.8/19.5)
## : employment_duration in {> 7 years,
## : : 4 - 7 years}: no (18.2/2.6)
## : employment_duration = < 1 year:
## : :...job in {management,skilled,unemployed}: no (16.3/2.9)
## : job = unskilled: yes (6/1.6)
## percent_of_income > 2:
## :...years_at_residence <= 1:
## :...other_credit in {bank,store}: no (7.6)
## : other_credit = none:
## : :...months_loan_duration > 42: no (2.9)
## : months_loan_duration <= 42:
## : :...age <= 36: no (26.6/8.4)
## : age > 36: yes (5.3)
## years_at_residence > 1:
## :...job = unemployed: no (5.2)
## job in {management,skilled,unskilled}:
## :...credit_history = perfect: yes (10.9)
## credit_history in {poor,critical,good,very good}:
## :...employment_duration = < 1 year:
## :...checking_balance = > 200 DM: no (2.7)
## : checking_balance in {1 - 200 DM,< 0 DM}:
## : :...months_loan_duration > 21: yes (23.4/0.7)
## : months_loan_duration <= 21:
## : :...amount <= 1928: yes (18.4/4.4)
## : amount > 1928: no (4.5)
## employment_duration in {unemployed,> 7 years,4 - 7 years,
## : 1 - 4 years}:
## :...months_loan_duration <= 11:
## :...age > 47: no (12.2)
## : age <= 47:
## : :...purpose in {car0,car,furniture/appliances,
## : : business,renovations}: no (25/9.2)
## : purpose = education: yes (3.5)
## months_loan_duration > 11:
## :...savings_balance in {100 - 500 DM,> 1000 DM}:
## :...age <= 58: no (22.7/3.4)
## : age > 58: yes (4.4)
## savings_balance in {unknown,500 - 1000 DM,< 100 DM}:
## :...years_at_residence <= 2: yes (76.1/22.8)
## years_at_residence > 2:
## :...purpose in {car0,business,
## : education}: yes (24.7/7.1)
## purpose = renovations: no (1.1)
## purpose = furniture/appliances: [S1]
## purpose = car:
## :...amount <= 1388: yes (17.8/2.2)
## amount > 1388:
## :...housing = own: no (10.9)
## housing in {other,rent}: [S2]
##
## SubTree [S1]
##
## employment_duration = unemployed: no (4.4)
## employment_duration in {> 7 years,4 - 7 years,1 - 4 years}:
## :...checking_balance in {1 - 200 DM,> 200 DM}: no (29/10.5)
## checking_balance = < 0 DM: yes (35.6/12.4)
##
## SubTree [S2]
##
## savings_balance = unknown: no (6.8/1.5)
## savings_balance in {500 - 1000 DM,< 100 DM}: yes (21.4/6.4)
##
## ----- Trial 6: -----
##
## Decision tree:
##
## checking_balance in {unknown,> 200 DM}:
## :...purpose = car0: no (2.2)
## : purpose = renovations: yes (8.4/3.3)
## : purpose = education:
## : :...age <= 44: yes (19.8/7.7)
## : : age > 44: no (4.4)
## : purpose = car:
## : :...job in {management,unemployed}: no (20.8/1.6)
## : : job = unskilled:
## : : :...years_at_residence <= 3: no (11/1.3)
## : : : years_at_residence > 3: yes (14.5/3.2)
## : : job = skilled:
## : : :...other_credit in {bank,store}: yes (17.6/4.9)
## : : other_credit = none:
## : : :...existing_loans_count <= 2: no (24.6)
## : : existing_loans_count > 2: yes (2.4/0.3)
## : purpose = business:
## : :...existing_loans_count > 2: yes (3.3)
## : : existing_loans_count <= 2:
## : : :...amount <= 1823: no (8.1)
## : : amount > 1823:
## : : :...percent_of_income <= 3: no (12.1/3.3)
## : : percent_of_income > 3: yes (13.2/3.4)
## : purpose = furniture/appliances:
## : :...age > 44: no (22.7)
## : age <= 44:
## : :...job = unemployed: no (0)
## : job = unskilled:
## : :...existing_loans_count <= 1: yes (20.9/5.6)
## : : existing_loans_count > 1: no (4.5)
## : job in {management,skilled}:
## : :...dependents > 1: no (6.6)
## : dependents <= 1:
## : :...existing_loans_count <= 1:
## : :...savings_balance in {100 - 500 DM,unknown,500 - 1000 DM,
## : : : > 1000 DM}: no (16.9)
## : : savings_balance = < 100 DM:
## : : :...age <= 22: yes (8.5/1.3)
## : : age > 22: no (43.1/8.8)
## : existing_loans_count > 1:
## : :...housing in {other,rent}: yes (9.9/2.1)
## : housing = own:
## : :...credit_history in {poor,critical,
## : : very good}: no (18.6/1.6)
## : credit_history in {good,perfect}: yes (14.9/4.3)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history = perfect: yes (28.1/9.6)
## credit_history = very good:
## :...age <= 23: no (5.5)
## : age > 23: yes (30/8.1)
## credit_history = poor:
## :...percent_of_income <= 1: no (6.5)
## : percent_of_income > 1:
## : :...savings_balance in {unknown,500 - 1000 DM}: no (6.4)
## : savings_balance in {100 - 500 DM,< 100 DM,> 1000 DM}:
## : :...dependents <= 1: yes (25.1/8)
## : dependents > 1: no (5/0.9)
## credit_history = critical:
## :...savings_balance = unknown: no (8.4)
## : savings_balance in {100 - 500 DM,500 - 1000 DM,< 100 DM,> 1000 DM}:
## : :...other_credit = bank: yes (16.2/4.3)
## : other_credit = store: no (3.7/0.9)
## : other_credit = none:
## : :...savings_balance = 100 - 500 DM: no (5.9)
## : savings_balance in {500 - 1000 DM,> 1000 DM}: yes (7.3/2.3)
## : savings_balance = < 100 DM:
## : :...purpose in {car0,education,renovations}: yes (8.5/2.2)
## : purpose = business: no (4.5/2.2)
## : purpose = car:
## : :...age <= 29: yes (6.9)
## : : age > 29: no (25.6/6.9)
## : purpose = furniture/appliances:
## : :...months_loan_duration <= 36: no (38.4/10.9)
## : months_loan_duration > 36: yes (3.8)
## credit_history = good:
## :...amount > 8086: yes (24/3.8)
## amount <= 8086:
## :...phone = yes:
## :...age <= 28: yes (23.9/7.5)
## : age > 28: no (69.4/17.9)
## phone = no:
## :...other_credit in {bank,store}: yes (25.1/7.2)
## other_credit = none:
## :...percent_of_income <= 2:
## :...job in {management,unskilled,unemployed}: no (15.6/2.7)
## : job = skilled:
## : :...amount <= 1386: yes (9.9/1)
## : amount > 1386:
## : :...age <= 24: yes (13.4/4.6)
## : age > 24: no (27.8/3.1)
## percent_of_income > 2:
## :...checking_balance = < 0 DM: yes (62.5/21.4)
## checking_balance = 1 - 200 DM:
## :...months_loan_duration > 42: yes (4.9)
## months_loan_duration <= 42:
## :...existing_loans_count > 1: no (5)
## existing_loans_count <= 1:
## :...age <= 35: no (39.4/13.2)
## age > 35: yes (14.7/4.2)
##
## ----- Trial 7: -----
##
## Decision tree:
##
## checking_balance = unknown:
## :...employment_duration = unemployed: yes (16.6/8)
## : employment_duration in {> 7 years,4 - 7 years}: no (101.1/20.4)
## : employment_duration = < 1 year:
## : :...amount <= 4594: no (30/5.7)
## : : amount > 4594: yes (10.6/0.3)
## : employment_duration = 1 - 4 years:
## : :...dependents > 1: no (8)
## : dependents <= 1:
## : :...months_loan_duration <= 16: no (32.8/5.3)
## : months_loan_duration > 16:
## : :...existing_loans_count > 2: yes (2.7)
## : existing_loans_count <= 2:
## : :...percent_of_income <= 3: no (20.9/5.9)
## : percent_of_income > 3:
## : :...purpose in {car,furniture/appliances,
## : : renovations}: no (19.7/7.5)
## : purpose in {car0,business,education}: yes (10.8)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...purpose in {car0,education,renovations}: no (67.2/29.2)
## purpose = car:
## :...amount <= 1297: yes (52.4/12.9)
## : amount > 1297:
## : :...percent_of_income <= 2:
## : :...phone = no: no (32.7/6.1)
## : : phone = yes:
## : : :...years_at_residence <= 3: no (20/4.9)
## : : years_at_residence > 3: yes (14.7/3.8)
## : percent_of_income > 2:
## : :...percent_of_income <= 3: yes (33.1/11.3)
## : percent_of_income > 3:
## : :...months_loan_duration <= 18: no (18.2/1.6)
## : months_loan_duration > 18:
## : :...existing_loans_count <= 1: no (19.5/7.2)
## : existing_loans_count > 1: yes (13.8/1)
## purpose = business:
## :...age > 46: yes (5.2)
## : age <= 46:
## : :...months_loan_duration <= 18: no (17.5)
## : months_loan_duration > 18:
## : :...other_credit in {bank,store}: no (10/0.5)
## : other_credit = none:
## : :...employment_duration in {unemployed,
## : : > 7 years}: yes (6.6)
## : employment_duration in {4 - 7 years,1 - 4 years,< 1 year}:
## : :...age <= 25: yes (4)
## : age > 25: no (19.2/5.6)
## purpose = furniture/appliances:
## :...savings_balance = 100 - 500 DM: yes (18.6/6)
## savings_balance = > 1000 DM: no (5.2)
## savings_balance in {unknown,500 - 1000 DM,< 100 DM}:
## :...existing_loans_count > 1:
## :...existing_loans_count > 2: no (3.6)
## : existing_loans_count <= 2:
## : :...housing = other: yes (3.3)
## : housing in {own,rent}:
## : :...savings_balance = unknown: no (6.9)
## : savings_balance = 500 - 1000 DM: yes (3.5/1)
## : savings_balance = < 100 DM:
## : :...age > 54: yes (2.1)
## : age <= 54: [S1]
## existing_loans_count <= 1:
## :...credit_history in {poor,very good}: no (20.8/9.5)
## credit_history in {critical,perfect}: yes (20.3/7.6)
## credit_history = good:
## :...months_loan_duration <= 7: no (11.4)
## months_loan_duration > 7:
## :...other_credit = bank: no (14.2/4.6)
## other_credit = store: yes (11.7/3.9)
## other_credit = none:
## :...percent_of_income <= 1: no (20.5/5.2)
## percent_of_income > 1:
## :...amount > 6078: yes (10.9/1.1)
## amount <= 6078:
## :...dependents > 1: yes (8.7/2.5)
## dependents <= 1: [S2]
##
## SubTree [S1]
##
## employment_duration in {unemployed,> 7 years,1 - 4 years}: no (25.7/2.9)
## employment_duration in {4 - 7 years,< 1 year}: yes (15/2.5)
##
## SubTree [S2]
##
## employment_duration = > 7 years: no (17.9/2.5)
## employment_duration in {unemployed,4 - 7 years,1 - 4 years,< 1 year}:
## :...job = management: no (6.6)
## job = unemployed: yes (1.1)
## job in {skilled,unskilled}:
## :...years_at_residence <= 1: no (11.8/1.8)
## years_at_residence > 1:
## :...checking_balance = 1 - 200 DM: yes (25.1/8.8)
## checking_balance = > 200 DM: no (14.7/6.3)
## checking_balance = < 0 DM:
## :...months_loan_duration <= 16: no (13.8/3.4)
## months_loan_duration > 16: yes (19.1/5.5)
##
## ----- Trial 8: -----
##
## Decision tree:
##
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...credit_history = perfect:
## : :...housing in {other,rent}: yes (8.3)
## : : housing = own:
## : : :...age <= 34: no (16.6/4.7)
## : : age > 34: yes (5.8)
## : credit_history = poor:
## : :...checking_balance = < 0 DM: yes (12/2.7)
## : : checking_balance = 1 - 200 DM:
## : : :...housing = rent: no (8.6)
## : : housing in {other,own}:
## : : :...amount <= 2279: yes (6.8/0.6)
## : : amount > 2279: no (20/5.7)
## : credit_history = very good:
## : :...existing_loans_count > 1: yes (2.5)
## : : existing_loans_count <= 1:
## : : :...age <= 23: no (3.7)
## : : age > 23:
## : : :...amount <= 8386: yes (32.9/8.1)
## : : amount > 8386: no (2.5)
## : credit_history = critical:
## : :...years_at_residence <= 1: no (8)
## : : years_at_residence > 1:
## : : :...savings_balance in {100 - 500 DM,unknown,500 - 1000 DM,
## : : : > 1000 DM}: no (25.5/5.7)
## : : savings_balance = < 100 DM:
## : : :...age > 61: no (6)
## : : age <= 61:
## : : :...existing_loans_count > 2: no (10.7/2.4)
## : : existing_loans_count <= 2:
## : : :...age > 56: yes (5.4)
## : : age <= 56:
## : : :...amount > 2483: yes (34.1/8.9)
## : : amount <= 2483:
## : : :...purpose in {car0,car,furniture/appliances,
## : : : renovations}: no (41.4/10.8)
## : : purpose in {business,education}: yes (4.4)
## : credit_history = good:
## : :...amount > 8086: yes (26.6/4.8)
## : amount <= 8086:
## : :...savings_balance in {500 - 1000 DM,> 1000 DM}: no (17.5/5.1)
## : savings_balance = 100 - 500 DM:
## : :...months_loan_duration <= 27: no (21.3/7.1)
## : : months_loan_duration > 27: yes (5.1)
## : savings_balance = unknown:
## : :...age <= 56: yes (44.7/16.9)
## : : age > 56: no (4.4)
## : savings_balance = < 100 DM:
## : :...job = unemployed: yes (0.9)
## : job = management:
## : :...employment_duration in {unemployed,4 - 7 years,1 - 4 years,
## : : : < 1 year}: no (17.3/1.6)
## : : employment_duration = > 7 years: yes (8/1.2)
## : job = unskilled:
## : :...months_loan_duration <= 26: no (59/19.7)
## : : months_loan_duration > 26: yes (3.3)
## : job = skilled:
## : :...purpose in {car0,business,education,
## : : renovations}: yes (16.6/4.1)
## : purpose = car:
## : :...dependents <= 1: yes (27.7/10.6)
## : : dependents > 1: no (8.1/1.4)
## : purpose = furniture/appliances:
## : :...years_at_residence <= 1: no (18.7/6.5)
## : years_at_residence > 1:
## : :...other_credit = bank: yes (4.5)
## : other_credit = store: no (2.3)
## : other_credit = none:
## : :...percent_of_income <= 3: yes (33.5/15)
## : percent_of_income > 3: no (27.3/9.3)
## checking_balance in {unknown,> 200 DM}:
## :...years_at_residence > 2: no (135.6/32.2)
## years_at_residence <= 2:
## :...months_loan_duration <= 8: no (12.9)
## months_loan_duration > 8:
## :...months_loan_duration <= 9: yes (10.4/1.3)
## months_loan_duration > 9:
## :...months_loan_duration <= 16: no (31.3/4.2)
## months_loan_duration > 16:
## :...purpose in {car0,business,renovations}: no (21.3/8.4)
## purpose = education: yes (6.3/0.8)
## purpose = car:
## :...credit_history in {poor,good,perfect}: no (9.6)
## : credit_history in {critical,very good}: yes (17.3/2.6)
## purpose = furniture/appliances:
## :...credit_history = poor: yes (4.9)
## credit_history in {critical,perfect,
## : very good}: no (5.6)
## credit_history = good:
## :...housing in {other,rent}: no (2.6)
## housing = own:
## :...age <= 25: no (6.8)
## age > 25: yes (29.2/10.2)
##
## ----- Trial 9: -----
##
## Decision tree:
##
## checking_balance = unknown:
## :...dependents > 1: no (26)
## : dependents <= 1:
## : :...amount <= 1474: no (39.7)
## : amount > 1474:
## : :...employment_duration in {> 7 years,4 - 7 years}:
## : :...years_at_residence > 2: no (21.8)
## : : years_at_residence <= 2:
## : : :...age <= 23: yes (4.1)
## : : age > 23: no (19.7/4.2)
## : employment_duration in {unemployed,1 - 4 years,< 1 year}:
## : :...purpose in {business,renovations}: yes (23.2/3.6)
## : purpose in {car0,car,furniture/appliances,education}:
## : :...other_credit in {bank,store}: yes (29.1/10.5)
## : other_credit = none:
## : :...purpose in {car0,car}: no (12.3)
## : purpose in {furniture/appliances,education}:
## : :...amount <= 4455: no (23.7/4.4)
## : amount > 4455: yes (11.1/1.3)
## checking_balance in {1 - 200 DM,> 200 DM,< 0 DM}:
## :...percent_of_income <= 2:
## :...amount > 11054: yes (15.7/3.6)
## : amount <= 11054:
## : :...savings_balance in {unknown,500 - 1000 DM,
## : : > 1000 DM}: no (41.5/11.2)
## : savings_balance = 100 - 500 DM:
## : :...other_credit in {none,store}: yes (21.7/9.4)
## : : other_credit = bank: no (5.1)
## : savings_balance = < 100 DM:
## : :...employment_duration in {unemployed,> 7 years}: no (34.6/11.5)
## : employment_duration = 1 - 4 years:
## : :...job = management: yes (5.1/0.8)
## : : job in {skilled,unskilled,unemployed}: no (65.4/15.8)
## : employment_duration = 4 - 7 years:
## : :...dependents > 1: no (4.6)
## : : dependents <= 1:
## : : :...amount <= 6527: no (16.8/7.2)
## : : amount > 6527: yes (7)
## : employment_duration = < 1 year:
## : :...amount <= 2327:
## : :...age <= 34: yes (20.5/1.9)
## : : age > 34: no (3)
## : amount > 2327:
## : :...other_credit in {none,store}: no (20.1/3.9)
## : other_credit = bank: yes (2.8)
## percent_of_income > 2:
## :...housing = rent:
## :...checking_balance in {1 - 200 DM,< 0 DM}: yes (69/22.1)
## : checking_balance = > 200 DM: no (3.4)
## housing = other:
## :...existing_loans_count > 1: yes (18.7/5.3)
## : existing_loans_count <= 1:
## : :...savings_balance in {100 - 500 DM,unknown}: no (15.3/3.2)
## : savings_balance in {500 - 1000 DM,< 100 DM,
## : > 1000 DM}: yes (29.1/8.6)
## housing = own:
## :...credit_history in {poor,perfect}: yes (26.9/7.4)
## credit_history = very good: no (14.9/5.6)
## credit_history = critical:
## :...other_credit in {none,store}: no (63/20.3)
## : other_credit = bank: yes (11.7/3.4)
## credit_history = good:
## :...other_credit = store: yes (8.9/1.4)
## other_credit in {none,bank}:
## :...age > 54: no (9.5)
## age <= 54:
## :...existing_loans_count > 1: no (10.2/2.7)
## existing_loans_count <= 1:
## :...purpose in {business,renovations}: no (10.1/3.6)
## purpose in {car0,education}: yes (4.7)
## purpose = car:
## :...other_credit = bank: yes (4.9)
## : other_credit = none:
## : :...years_at_residence > 2: no (14.8/4.5)
## : years_at_residence <= 2:
## : :...amount <= 2150: no (14.9/6.2)
## : amount > 2150: yes (11.1)
## purpose = furniture/appliances:
## :...savings_balance = 100 - 500 DM: yes (3.8)
## savings_balance in {500 - 1000 DM,
## : > 1000 DM}: no (2.8)
## savings_balance in {unknown,< 100 DM}:
## :...months_loan_duration > 39: yes (3.3)
## months_loan_duration <= 39:
## :...dependents <= 1: no (57.6/19.4)
## dependents > 1: yes (4.6/1.1)
##
##
## Evaluation on training data (900 cases):
##
## Trial Decision Tree
## ----- ----------------
## Size Errors
##
## 0 56 133(14.8%)
## 1 34 211(23.4%)
## 2 39 201(22.3%)
## 3 47 179(19.9%)
## 4 46 174(19.3%)
## 5 50 197(21.9%)
## 6 55 187(20.8%)
## 7 50 190(21.1%)
## 8 51 192(21.3%)
## 9 47 169(18.8%)
## boost 34( 3.8%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 629 4 (a): class no
## 30 237 (b): class yes
##
##
## Attribute usage:
##
## 100.00% checking_balance
## 100.00% purpose
## 97.11% years_at_residence
## 96.67% employment_duration
## 94.78% credit_history
## 94.67% other_credit
## 92.56% job
## 92.11% percent_of_income
## 90.33% amount
## 85.11% months_loan_duration
## 82.78% age
## 82.78% existing_loans_count
## 75.78% dependents
## 71.56% housing
## 70.78% savings_balance
## 49.22% phone
##
##
## Time: 0.0 secs
Then we can use the boosted model on our test dataset to see its performance.
credit_boost_pred10 <- predict(credit_boost10, credit_test)
CrossTable(credit_test$default, credit_boost_pred10,
prop.chisq = FALSE,
prop.c = FALSE,
prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
## ## ## Cell Contents ## |-------------------------| ## | N | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 100 ## ## ## | predicted default ## actual default | no | yes | Row Total | ## ---------------|-----------|-----------|-----------| ## no | 62 | 5 | 67 | ## | 0.620 | 0.050 | | ## ---------------|-----------|-----------|-----------| ## yes | 13 | 20 | 33 | ## | 0.130 | 0.200 | | ## ---------------|-----------|-----------|-----------| ## Column Total | 75 | 25 | 100 | ## ---------------|-----------|-----------|-----------| ## ##
Here, we reduced the total error rate from 27 percent prior to boosting down to 18 percent in the boosted model.
Giving a loan to an applicant who is likely to default can be an expensive mistake. One solution is to reduce the number of false negatives by specifying a cost matrix.
The C5.0 algorithm allows us to assign a penalty to different types of errors, in order to discourage a tree from making more costly mistakes. The penalties are designated in a cost matrix, which specifies how much costlier each error is, relative to any other prediction.
We can first start by specifying the dimensions
matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))
names(matrix_dimensions) <- c("predicted", "actual")
matrix_dimensions
## $predicted ## [1] "no" "yes" ## ## $actual ## [1] "no" "yes"
Next, we need to assign the penalty for the various types of errors by supplying four values to fill the matrix. We need to assign the penalty/cost score in a specific order
Predicted no, actual no
Predicted yes, actual no
Predicted no, actual yes
Predicted yes, actual yes
Suppose we believe that a loan default costs the bank four times as much as a missed opportunity. Our penalty values could then be defined as:
# build the matrix
error_cost <- matrix(c(0, 1, 4, 0),
nrow = 2,
dimnames = matrix_dimensions)
error_cost
## actual ## predicted no yes ## no 0 4 ## yes 1 0
To see how this impacts classification, let’s apply it to our decision tree using the costs parameter of the C5.0() function.
credit_cost <- C5.0(credit_train[-17], credit_train$default,
costs = error_cost)
credit_cost_pred <- predict(credit_cost, credit_test)
CrossTable(credit_test$default, credit_cost_pred,
prop.chisq = FALSE,
prop.c = FALSE,
prop.r = FALSE,
dnn = c('actual default', 'predicted default'))
## ## ## Cell Contents ## |-------------------------| ## | N | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 100 ## ## ## | predicted default ## actual default | no | yes | Row Total | ## ---------------|-----------|-----------|-----------| ## no | 37 | 30 | 67 | ## | 0.370 | 0.300 | | ## ---------------|-----------|-----------|-----------| ## yes | 7 | 26 | 33 | ## | 0.070 | 0.260 | | ## ---------------|-----------|-----------|-----------| ## Column Total | 44 | 56 | 100 | ## ---------------|-----------|-----------|-----------| ## ##
Trees for numeric prediction fall into two categories.
Regression trees, were introduced in the 1980s as part of the seminal Classification and Regression Tree (CART) algorithm.
Model trees are grown in much the same way as regression trees, but at each leaf, a multiple linear regression model is built from the examples reaching that node.
Although regression methods are typically the first choice for numeric prediction tasks, regression trees or models may be better for features with many complex and non-linear relationships
Regression also makes assumptions about how numeric data is distributed and that are often vilated in real-world data.
Winemaking is a challenging and competitive business that offers the potential for great profit. However, there are numerous factors that contribute to the profitability of a winery. Variables such as weather, the growing environment, the bottling, manufacturing, bottle design, or even price point, can affect the customer’s perception of taste.
More recently, machine learning has been employed to assist with rating the quality of wine—a notoriously difficult task. A review written by a renowned wine critic often determines whether the product ends up on the top or bottom shelf.
In this case study, we will use regression trees and model trees to create a system capable of mimicking expert ratings of wine. Computer-aided wine testing may therefore result in a better product as well as more objective, consistent, and fair ratings.
We will first download and import the whitewine data whitewine, which includes examples of white Vinho Verde wines from Portugal—one of the world’s leading wine-producing countries.
wine <- read.csv("winequality-white.csv")
The white wine data includes information on 11 chemical properties of 4,898 wine samples.
# examine the wine data str(wine)
## 'data.frame': 4898 obs. of 12 variables: ## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ... ## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ... ## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ... ## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ... ## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ... ## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ... ## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ... ## $ density : num 1.001 0.994 0.995 0.996 0.996 ... ## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ... ## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ... ## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ... ## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
# the distribution of quality ratings hist(wine$quality)
# summary statistics of the wine data summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar ## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600 ## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700 ## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200 ## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391 ## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900 ## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800 ## chlorides free.sulfur.dioxide total.sulfur.dioxide density ## Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871 ## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917 ## Median :0.04300 Median : 34.00 Median :134.0 Median :0.9937 ## Mean :0.04577 Mean : 35.31 Mean :138.4 Mean :0.9940 ## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961 ## Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390 ## pH sulphates alcohol quality ## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000 ## 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000 ## Median :3.180 Median :0.4700 Median :10.40 Median :6.000 ## Mean :3.188 Mean :0.4898 Mean :10.51 Mean :5.878 ## 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000 ## Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000
Our last step then is to divide into training and testing datasets. Since the wine data set was already sorted into random order, we can partition into two sets: 75% train, 25% test dataset.
wine_train <- wine[1:3750, ] wine_test <- wine[3751:4898, ]
We will use the rpart (recursive partitioning) package offers the most faithful implementation of regression trees as they were described by the CART team.
#install.packages("rpart")
library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)
summary(m.rpart)
## Call: ## rpart(formula = quality ~ ., data = wine_train) ## n= 3750 ## ## CP nsplit rel error xerror xstd ## 1 0.17816211 0 1.0000000 1.0010391 0.02390494 ## 2 0.04439109 1 0.8218379 0.8232523 0.02238387 ## 3 0.02890893 2 0.7774468 0.7885622 0.02218078 ## 4 0.01655575 3 0.7485379 0.7610867 0.02103390 ## 5 0.01108600 4 0.7319821 0.7498581 0.02060005 ## 6 0.01000000 5 0.7208961 0.7434623 0.02026718 ## ## Variable importance ## alcohol density chlorides ## 38 23 12 ## volatile.acidity total.sulfur.dioxide free.sulfur.dioxide ## 12 7 6 ## sulphates pH residual.sugar ## 1 1 1 ## ## Node number 1: 3750 observations, complexity param=0.1781621 ## mean=5.886933, MSE=0.8373493 ## left son=2 (2473 obs) right son=3 (1277 obs) ## Primary splits: ## alcohol < 10.85 to the left, improve=0.17816210, (0 missing) ## density < 0.992385 to the right, improve=0.11980970, (0 missing) ## chlorides < 0.0395 to the right, improve=0.08199995, (0 missing) ## total.sulfur.dioxide < 153.5 to the right, improve=0.03875440, (0 missing) ## free.sulfur.dioxide < 11.75 to the left, improve=0.03632119, (0 missing) ## Surrogate splits: ## density < 0.99201 to the right, agree=0.869, adj=0.614, (0 split) ## chlorides < 0.0375 to the right, agree=0.773, adj=0.334, (0 split) ## total.sulfur.dioxide < 102.5 to the right, agree=0.705, adj=0.132, (0 split) ## sulphates < 0.345 to the right, agree=0.670, adj=0.031, (0 split) ## fixed.acidity < 5.25 to the right, agree=0.662, adj=0.009, (0 split) ## ## Node number 2: 2473 observations, complexity param=0.04439109 ## mean=5.609381, MSE=0.6108623 ## left son=4 (1406 obs) right son=5 (1067 obs) ## Primary splits: ## volatile.acidity < 0.2425 to the right, improve=0.09227123, (0 missing) ## free.sulfur.dioxide < 13.5 to the left, improve=0.04177240, (0 missing) ## alcohol < 10.15 to the left, improve=0.03313802, (0 missing) ## citric.acid < 0.205 to the left, improve=0.02721200, (0 missing) ## pH < 3.325 to the left, improve=0.01860335, (0 missing) ## Surrogate splits: ## total.sulfur.dioxide < 111.5 to the right, agree=0.610, adj=0.097, (0 split) ## pH < 3.295 to the left, agree=0.598, adj=0.067, (0 split) ## alcohol < 10.05 to the left, agree=0.590, adj=0.049, (0 split) ## sulphates < 0.715 to the left, agree=0.584, adj=0.037, (0 split) ## residual.sugar < 1.85 to the right, agree=0.581, adj=0.029, (0 split) ## ## Node number 3: 1277 observations, complexity param=0.02890893 ## mean=6.424432, MSE=0.8378682 ## left son=6 (93 obs) right son=7 (1184 obs) ## Primary splits: ## free.sulfur.dioxide < 11.5 to the left, improve=0.08484051, (0 missing) ## alcohol < 11.85 to the left, improve=0.06149941, (0 missing) ## fixed.acidity < 7.35 to the right, improve=0.04259695, (0 missing) ## residual.sugar < 1.275 to the left, improve=0.02795662, (0 missing) ## total.sulfur.dioxide < 67.5 to the left, improve=0.02541719, (0 missing) ## Surrogate splits: ## total.sulfur.dioxide < 48.5 to the left, agree=0.937, adj=0.14, (0 split) ## ## Node number 4: 1406 observations, complexity param=0.011086 ## mean=5.40256, MSE=0.526423 ## left son=8 (182 obs) right son=9 (1224 obs) ## Primary splits: ## volatile.acidity < 0.4225 to the right, improve=0.04703189, (0 missing) ## free.sulfur.dioxide < 17.5 to the left, improve=0.04607770, (0 missing) ## total.sulfur.dioxide < 86.5 to the left, improve=0.02894310, (0 missing) ## alcohol < 10.25 to the left, improve=0.02890077, (0 missing) ## chlorides < 0.0455 to the right, improve=0.02096635, (0 missing) ## Surrogate splits: ## density < 0.99107 to the left, agree=0.874, adj=0.027, (0 split) ## citric.acid < 0.11 to the left, agree=0.873, adj=0.022, (0 split) ## fixed.acidity < 9.85 to the right, agree=0.873, adj=0.016, (0 split) ## chlorides < 0.206 to the right, agree=0.871, adj=0.005, (0 split) ## ## Node number 5: 1067 observations ## mean=5.881912, MSE=0.591491 ## ## Node number 6: 93 observations ## mean=5.473118, MSE=1.066482 ## ## Node number 7: 1184 observations, complexity param=0.01655575 ## mean=6.499155, MSE=0.7432425 ## left son=14 (611 obs) right son=15 (573 obs) ## Primary splits: ## alcohol < 11.85 to the left, improve=0.05907511, (0 missing) ## fixed.acidity < 7.35 to the right, improve=0.04400660, (0 missing) ## density < 0.991395 to the right, improve=0.02522410, (0 missing) ## residual.sugar < 1.225 to the left, improve=0.02503936, (0 missing) ## pH < 3.245 to the left, improve=0.02417936, (0 missing) ## Surrogate splits: ## density < 0.991115 to the right, agree=0.710, adj=0.401, (0 split) ## volatile.acidity < 0.2675 to the left, agree=0.665, adj=0.307, (0 split) ## chlorides < 0.0365 to the right, agree=0.631, adj=0.237, (0 split) ## total.sulfur.dioxide < 126.5 to the right, agree=0.566, adj=0.103, (0 split) ## residual.sugar < 1.525 to the left, agree=0.560, adj=0.091, (0 split) ## ## Node number 8: 182 observations ## mean=4.994505, MSE=0.5109588 ## ## Node number 9: 1224 observations ## mean=5.463235, MSE=0.5002823 ## ## Node number 14: 611 observations ## mean=6.296236, MSE=0.7322117 ## ## Node number 15: 573 observations ## mean=6.715532, MSE=0.6642788
Although the tree can be understood using only the preceding output, it is often more readily understood using visualization. The rpart.plot package by Stephen Milborrow provides an easy-to-use function for visualization.
#install.packages("rpart.plot")
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
rpart.plot(m.rpart, digits = 3)
# a few adjustments to the diagram
rpart.plot(m.rpart, digits = 4,
fallen.leaves = TRUE,
type = 3, extra = 101)
To use the regression tree model to make predictions on the test data, we use the predict() function.
# generate predictions for the testing dataset p.rpart <- predict(m.rpart, wine_test) # compare the distribution of predicted values vs. actual values summary(p.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 4.995 5.463 5.882 5.999 6.296 6.716
summary(wine_test$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.000 5.000 6.000 5.848 6.000 8.000
This finding suggests that the model is not correctly identifying the extreme cases, in particular the best and worst wines. Between the first and third quartile, we may be doing well.
Another way to think about the model’s performance is to consider how far, on average, its prediction was from the true value. This measurement is called the mean absolute error (MAE).
# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
On a quality scale from zero to 10, this seems to suggest that our model is doing fairly well.
# mean absolute error between predicted and actual values MAE(p.rpart, wine_test$quality)
## [1] 0.5732104