Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
library(caret) #this package contains the german data with its numeric format
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Loading required package: lattice
data(GermanCredit)
#code the Class variable into 1 and 0, True means 1 and 0 means Bad.
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good")
GermanCredit$Class <- as.factor(GermanCredit$Class)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
#load tree model packages
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
#This is the code that drop variables that provide no useful information in the data
#Only run this code chunk ONCE.
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2024 for
reproducibility. (5pts)set.seed(2024)
train_index <- sample(1:nrow(GermanCredit), 0.8 * nrow(GermanCredit))
train.df <- GermanCredit[train_index, ]
test.df <- GermanCredit[-train_index, ]
Class) are right. (10pts)train.df$Class <- as.factor(train.df$Class)
test.df$Class <- as.factor(test.df$Class)
tree.model <- rpart(Class ~ ., data = train.df, method = "class")
summary(tree.model)
## Call:
## rpart(formula = Class ~ ., data = train.df, method = "class")
## n= 800
##
## CP nsplit rel error xerror xstd
## 1 0.03056769 0 1.0000000 1.000000 0.05582842
## 2 0.02401747 4 0.8777293 1.013100 0.05604510
## 3 0.01965066 8 0.7729258 1.004367 0.05590117
## 4 0.01310044 10 0.7336245 1.017467 0.05611630
## 5 0.01091703 19 0.6069869 1.013100 0.05604510
## 6 0.01000000 21 0.5851528 1.008734 0.05597339
##
## Variable importance
## Amount Duration
## 15 14
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 14 14
## Age SavingsAccountBonds.lt.100
## 7 4
## Purpose.UsedCar Job.SkilledEmployee
## 4 3
## InstallmentRatePercentage Property.RealEstate
## 3 3
## SavingsAccountBonds.100.to.500 OtherInstallmentPlans.Stores
## 3 2
## NumberExistingCredits Purpose.Business
## 2 2
## Job.UnskilledResident EmploymentDuration.lt.1
## 1 1
## OtherDebtorsGuarantors.None Telephone
## 1 1
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 1 1
## Personal.Male.Divorced.Seperated Property.Insurance
## 1 1
## Purpose.Furniture.Equipment
## 1
##
## Node number 1: 800 observations, complexity param=0.03056769
## predicted class=1 expected loss=0.28625 P(node) =1
## class counts: 229 571
## probabilities: 0.286 0.714
## left son=2 (211 obs) right son=3 (589 obs)
## Primary splits:
## CheckingAccountStatus.lt.0 < 0.5 to the right, improve=21.222720, (0 missing)
## Duration < 25.5 to the right, improve=13.584620, (0 missing)
## Amount < 10918 to the right, improve=12.537530, (0 missing)
## SavingsAccountBonds.lt.100 < 0.5 to the right, improve= 8.092071, (0 missing)
## CreditHistory.ThisBank.AllPaid < 0.5 to the right, improve= 7.040837, (0 missing)
## Surrogate splits:
## Amount < 355.5 to the left, agree=0.738, adj=0.005, (0 split)
##
## Node number 2: 211 observations, complexity param=0.03056769
## predicted class=1 expected loss=0.478673 P(node) =0.26375
## class counts: 101 110
## probabilities: 0.479 0.521
## left son=4 (178 obs) right son=5 (33 obs)
## Primary splits:
## Duration < 11.5 to the right, improve=8.373770, (0 missing)
## Amount < 4802.5 to the right, improve=4.982836, (0 missing)
## CreditHistory.Delay < 0.5 to the right, improve=3.726962, (0 missing)
## Job.SkilledEmployee < 0.5 to the right, improve=3.414315, (0 missing)
## ForeignWorker < 0.5 to the right, improve=3.382024, (0 missing)
## Surrogate splits:
## Age < 66.5 to the left, agree=0.858, adj=0.091, (0 split)
## Amount < 617.5 to the right, agree=0.853, adj=0.061, (0 split)
##
## Node number 3: 589 observations, complexity param=0.03056769
## predicted class=1 expected loss=0.2173175 P(node) =0.73625
## class counts: 128 461
## probabilities: 0.217 0.783
## left son=6 (210 obs) right son=7 (379 obs)
## Primary splits:
## CheckingAccountStatus.0.to.200 < 0.5 to the right, improve=20.662260, (0 missing)
## Amount < 10918 to the right, improve=15.274340, (0 missing)
## Duration < 25.5 to the right, improve= 8.276487, (0 missing)
## OtherInstallmentPlans.Bank < 0.5 to the right, improve= 5.258972, (0 missing)
## Age < 25.5 to the left, improve= 4.661922, (0 missing)
## Surrogate splits:
## Duration < 43.5 to the right, agree=0.660, adj=0.048, (0 split)
## Amount < 11191 to the right, agree=0.660, adj=0.048, (0 split)
## CreditHistory.NoCredit.AllPaid < 0.5 to the right, agree=0.654, adj=0.029, (0 split)
## CreditHistory.ThisBank.AllPaid < 0.5 to the right, agree=0.652, adj=0.024, (0 split)
## SavingsAccountBonds.100.to.500 < 0.5 to the right, agree=0.650, adj=0.019, (0 split)
##
## Node number 4: 178 observations, complexity param=0.02401747
## predicted class=0 expected loss=0.4606742 P(node) =0.2225
## class counts: 96 82
## probabilities: 0.539 0.461
## left son=8 (38 obs) right son=9 (140 obs)
## Primary splits:
## Duration < 31.5 to the right, improve=3.769739, (0 missing)
## Job.SkilledEmployee < 0.5 to the right, improve=3.558204, (0 missing)
## CreditHistory.Delay < 0.5 to the right, improve=2.756581, (0 missing)
## Amount < 4802.5 to the right, improve=2.493525, (0 missing)
## Purpose.NewCar < 0.5 to the right, improve=2.196990, (0 missing)
## Surrogate splits:
## Amount < 6668.5 to the right, agree=0.843, adj=0.263, (0 split)
##
## Node number 5: 33 observations
## predicted class=1 expected loss=0.1515152 P(node) =0.04125
## class counts: 5 28
## probabilities: 0.152 0.848
##
## Node number 6: 210 observations, complexity param=0.03056769
## predicted class=1 expected loss=0.3952381 P(node) =0.2625
## class counts: 83 127
## probabilities: 0.395 0.605
## left son=12 (16 obs) right son=13 (194 obs)
## Primary splits:
## Amount < 9908.5 to the right, improve=10.185580, (0 missing)
## Duration < 22.5 to the right, improve= 6.836080, (0 missing)
## Property.RealEstate < 0.5 to the left, improve= 6.773416, (0 missing)
## Housing.Own < 0.5 to the left, improve= 4.050114, (0 missing)
## Age < 25.5 to the left, improve= 2.835462, (0 missing)
##
## Node number 7: 379 observations
## predicted class=1 expected loss=0.1187335 P(node) =0.47375
## class counts: 45 334
## probabilities: 0.119 0.881
##
## Node number 8: 38 observations
## predicted class=0 expected loss=0.2631579 P(node) =0.0475
## class counts: 28 10
## probabilities: 0.737 0.263
##
## Node number 9: 140 observations, complexity param=0.02401747
## predicted class=1 expected loss=0.4857143 P(node) =0.175
## class counts: 68 72
## probabilities: 0.486 0.514
## left son=18 (129 obs) right son=19 (11 obs)
## Primary splits:
## Purpose.UsedCar < 0.5 to the left, improve=5.632780, (0 missing)
## Amount < 1377 to the left, improve=3.929252, (0 missing)
## Purpose.NewCar < 0.5 to the right, improve=3.629554, (0 missing)
## Purpose.Business < 0.5 to the left, improve=2.208009, (0 missing)
## InstallmentRatePercentage < 2.5 to the right, improve=1.545196, (0 missing)
## Surrogate splits:
## Age < 61.5 to the left, agree=0.929, adj=0.091, (0 split)
##
## Node number 12: 16 observations
## predicted class=0 expected loss=0.0625 P(node) =0.02
## class counts: 15 1
## probabilities: 0.938 0.062
##
## Node number 13: 194 observations, complexity param=0.01965066
## predicted class=1 expected loss=0.3505155 P(node) =0.2425
## class counts: 68 126
## probabilities: 0.351 0.649
## left son=26 (136 obs) right son=27 (58 obs)
## Primary splits:
## Property.RealEstate < 0.5 to the left, improve=4.281722, (0 missing)
## Duration < 22.5 to the right, improve=3.588005, (0 missing)
## Age < 25.5 to the left, improve=3.343549, (0 missing)
## CreditHistory.ThisBank.AllPaid < 0.5 to the right, improve=2.575549, (0 missing)
## OtherDebtorsGuarantors.None < 0.5 to the right, improve=2.533807, (0 missing)
## Surrogate splits:
## OtherDebtorsGuarantors.None < 0.5 to the right, agree=0.768, adj=0.224, (0 split)
## Amount < 632 to the right, agree=0.716, adj=0.052, (0 split)
## Age < 20.5 to the right, agree=0.706, adj=0.017, (0 split)
## OtherDebtorsGuarantors.CoApplicant < 0.5 to the left, agree=0.706, adj=0.017, (0 split)
## Job.UnskilledResident < 0.5 to the left, agree=0.706, adj=0.017, (0 split)
##
## Node number 18: 129 observations, complexity param=0.02401747
## predicted class=0 expected loss=0.4728682 P(node) =0.16125
## class counts: 68 61
## probabilities: 0.527 0.473
## left son=36 (121 obs) right son=37 (8 obs)
## Primary splits:
## Purpose.Business < 0.5 to the left, improve=2.758425, (0 missing)
## Amount < 1377 to the left, improve=2.425020, (0 missing)
## Purpose.NewCar < 0.5 to the right, improve=2.289513, (0 missing)
## InstallmentRatePercentage < 2.5 to the right, improve=1.865086, (0 missing)
## Age < 30.5 to the right, improve=1.542534, (0 missing)
##
## Node number 19: 11 observations
## predicted class=1 expected loss=0 P(node) =0.01375
## class counts: 0 11
## probabilities: 0.000 1.000
##
## Node number 26: 136 observations, complexity param=0.01965066
## predicted class=1 expected loss=0.4191176 P(node) =0.17
## class counts: 57 79
## probabilities: 0.419 0.581
## left son=52 (31 obs) right son=53 (105 obs)
## Primary splits:
## Age < 25.5 to the left, improve=4.103230, (0 missing)
## Personal.Male.Single < 0.5 to the left, improve=3.308824, (0 missing)
## Purpose.NewCar < 0.5 to the right, improve=3.045537, (0 missing)
## Housing.Rent < 0.5 to the right, improve=2.499899, (0 missing)
## Amount < 931.5 to the left, improve=2.272952, (0 missing)
## Surrogate splits:
## OtherDebtorsGuarantors.None < 0.5 to the left, agree=0.794, adj=0.097, (0 split)
## Duration < 54 to the right, agree=0.779, adj=0.032, (0 split)
## Amount < 546.5 to the left, agree=0.779, adj=0.032, (0 split)
##
## Node number 27: 58 observations, complexity param=0.01310044
## predicted class=1 expected loss=0.1896552 P(node) =0.0725
## class counts: 11 47
## probabilities: 0.190 0.810
## left son=54 (7 obs) right son=55 (51 obs)
## Primary splits:
## Duration < 22 to the right, improve=4.3822080, (0 missing)
## Age < 31.5 to the left, improve=2.1545090, (0 missing)
## OtherDebtorsGuarantors.None < 0.5 to the right, improve=1.5894910, (0 missing)
## Amount < 1221.5 to the right, improve=1.1907440, (0 missing)
## Purpose.Furniture.Equipment < 0.5 to the right, improve=0.9088187, (0 missing)
## Surrogate splits:
## OtherInstallmentPlans.Stores < 0.5 to the right, agree=0.914, adj=0.286, (0 split)
## Personal.Male.Divorced.Seperated < 0.5 to the right, agree=0.897, adj=0.143, (0 split)
##
## Node number 36: 121 observations, complexity param=0.02401747
## predicted class=0 expected loss=0.446281 P(node) =0.15125
## class counts: 67 54
## probabilities: 0.554 0.446
## left son=72 (82 obs) right son=73 (39 obs)
## Primary splits:
## InstallmentRatePercentage < 2.5 to the right, improve=2.368882, (0 missing)
## Purpose.Furniture.Equipment < 0.5 to the left, improve=2.368882, (0 missing)
## Amount < 1577.5 to the left, improve=2.144262, (0 missing)
## Purpose.NewCar < 0.5 to the right, improve=1.585437, (0 missing)
## OtherDebtorsGuarantors.None < 0.5 to the right, improve=1.149010, (0 missing)
## Surrogate splits:
## Amount < 3571 to the left, agree=0.744, adj=0.205, (0 split)
## Personal.Male.Divorced.Seperated < 0.5 to the left, agree=0.744, adj=0.205, (0 split)
## Duration < 29 to the left, agree=0.686, adj=0.026, (0 split)
## NumberExistingCredits < 2.5 to the left, agree=0.686, adj=0.026, (0 split)
## Purpose.Furniture.Equipment < 0.5 to the left, agree=0.686, adj=0.026, (0 split)
##
## Node number 37: 8 observations
## predicted class=1 expected loss=0.125 P(node) =0.01
## class counts: 1 7
## probabilities: 0.125 0.875
##
## Node number 52: 31 observations
## predicted class=0 expected loss=0.3548387 P(node) =0.03875
## class counts: 20 11
## probabilities: 0.645 0.355
##
## Node number 53: 105 observations, complexity param=0.01310044
## predicted class=1 expected loss=0.352381 P(node) =0.13125
## class counts: 37 68
## probabilities: 0.352 0.648
## left son=106 (52 obs) right son=107 (53 obs)
## Primary splits:
## SavingsAccountBonds.lt.100 < 0.5 to the right, improve=2.455014, (0 missing)
## Age < 48.5 to the right, improve=1.981837, (0 missing)
## Amount < 931.5 to the left, improve=1.964626, (0 missing)
## Housing.Own < 0.5 to the left, improve=1.689451, (0 missing)
## Personal.Male.Single < 0.5 to the left, improve=1.658566, (0 missing)
## Surrogate splits:
## SavingsAccountBonds.100.to.500 < 0.5 to the left, agree=0.724, adj=0.442, (0 split)
## Job.SkilledEmployee < 0.5 to the left, agree=0.610, adj=0.212, (0 split)
## Age < 31.5 to the right, agree=0.600, adj=0.192, (0 split)
## Telephone < 0.5 to the left, agree=0.600, adj=0.192, (0 split)
## Purpose.Furniture.Equipment < 0.5 to the right, agree=0.571, adj=0.135, (0 split)
##
## Node number 54: 7 observations
## predicted class=0 expected loss=0.2857143 P(node) =0.00875
## class counts: 5 2
## probabilities: 0.714 0.286
##
## Node number 55: 51 observations
## predicted class=1 expected loss=0.1176471 P(node) =0.06375
## class counts: 6 45
## probabilities: 0.118 0.882
##
## Node number 72: 82 observations, complexity param=0.01310044
## predicted class=0 expected loss=0.3780488 P(node) =0.1025
## class counts: 51 31
## probabilities: 0.622 0.378
## left son=144 (40 obs) right son=145 (42 obs)
## Primary splits:
## Amount < 1577.5 to the left, improve=1.658595, (0 missing)
## Telephone < 0.5 to the right, improve=1.449397, (0 missing)
## Purpose.NewCar < 0.5 to the right, improve=1.370800, (0 missing)
## Purpose.Radio.Television < 0.5 to the left, improve=1.132404, (0 missing)
## Age < 55 to the left, improve=1.081246, (0 missing)
## Surrogate splits:
## Purpose.Furniture.Equipment < 0.5 to the left, agree=0.646, adj=0.275, (0 split)
## Duration < 16.5 to the left, agree=0.634, adj=0.250, (0 split)
## InstallmentRatePercentage < 3.5 to the right, agree=0.622, adj=0.225, (0 split)
## Telephone < 0.5 to the right, agree=0.622, adj=0.225, (0 split)
## Personal.Male.Single < 0.5 to the left, agree=0.622, adj=0.225, (0 split)
##
## Node number 73: 39 observations, complexity param=0.01310044
## predicted class=1 expected loss=0.4102564 P(node) =0.04875
## class counts: 16 23
## probabilities: 0.410 0.590
## left son=146 (26 obs) right son=147 (13 obs)
## Primary splits:
## Duration < 15.5 to the right, improve=2.564103, (0 missing)
## Telephone < 0.5 to the left, improve=1.538462, (0 missing)
## EmploymentDuration.lt.1 < 0.5 to the right, improve=1.538462, (0 missing)
## Age < 30.5 to the right, improve=1.257664, (0 missing)
## Amount < 1961.5 to the right, improve=1.189036, (0 missing)
## Surrogate splits:
## Amount < 1828.5 to the right, agree=0.846, adj=0.538, (0 split)
## Age < 35 to the left, agree=0.692, adj=0.077, (0 split)
##
## Node number 106: 52 observations, complexity param=0.01310044
## predicted class=1 expected loss=0.4615385 P(node) =0.065
## class counts: 24 28
## probabilities: 0.462 0.538
## left son=212 (32 obs) right son=213 (20 obs)
## Primary splits:
## NumberExistingCredits < 1.5 to the left, improve=2.908654, (0 missing)
## Duration < 28.5 to the right, improve=2.447658, (0 missing)
## Age < 35.5 to the right, improve=1.846154, (0 missing)
## ResidenceDuration < 1.5 to the right, improve=1.246671, (0 missing)
## EmploymentDuration.gt.7 < 0.5 to the right, improve=1.231775, (0 missing)
## Surrogate splits:
## Age < 27.5 to the right, agree=0.673, adj=0.15, (0 split)
## InstallmentRatePercentage < 1.5 to the right, agree=0.654, adj=0.10, (0 split)
## CreditHistory.PaidDuly < 0.5 to the right, agree=0.654, adj=0.10, (0 split)
## OtherInstallmentPlans.Stores < 0.5 to the left, agree=0.654, adj=0.10, (0 split)
## Job.UnemployedUnskilled < 0.5 to the left, agree=0.654, adj=0.10, (0 split)
##
## Node number 107: 53 observations, complexity param=0.01091703
## predicted class=1 expected loss=0.245283 P(node) =0.06625
## class counts: 13 40
## probabilities: 0.245 0.755
## left son=214 (24 obs) right son=215 (29 obs)
## Primary splits:
## SavingsAccountBonds.100.to.500 < 0.5 to the right, improve=2.576664, (0 missing)
## Amount < 1930 to the left, improve=1.693587, (0 missing)
## EmploymentDuration.lt.1 < 0.5 to the right, improve=1.611103, (0 missing)
## EmploymentDuration.1.to.4 < 0.5 to the left, improve=1.334922, (0 missing)
## NumberExistingCredits < 1.5 to the right, improve=1.280380, (0 missing)
## Surrogate splits:
## Purpose.NewCar < 0.5 to the right, agree=0.679, adj=0.292, (0 split)
## EmploymentDuration.1.to.4 < 0.5 to the left, agree=0.660, adj=0.250, (0 split)
## EmploymentDuration.lt.1 < 0.5 to the right, agree=0.642, adj=0.208, (0 split)
## Age < 31.5 to the left, agree=0.604, adj=0.125, (0 split)
## InstallmentRatePercentage < 1.5 to the left, agree=0.585, adj=0.083, (0 split)
##
## Node number 144: 40 observations
## predicted class=0 expected loss=0.275 P(node) =0.05
## class counts: 29 11
## probabilities: 0.725 0.275
##
## Node number 145: 42 observations, complexity param=0.01310044
## predicted class=0 expected loss=0.4761905 P(node) =0.0525
## class counts: 22 20
## probabilities: 0.524 0.476
## left son=290 (30 obs) right son=291 (12 obs)
## Primary splits:
## Amount < 2135.5 to the right, improve=2.5190480, (0 missing)
## SavingsAccountBonds.lt.100 < 0.5 to the right, improve=2.0836940, (0 missing)
## Housing.Own < 0.5 to the left, improve=1.2857140, (0 missing)
## Age < 24.5 to the right, improve=0.9523810, (0 missing)
## NumberExistingCredits < 1.5 to the right, improve=0.6857143, (0 missing)
## Surrogate splits:
## ForeignWorker < 0.5 to the right, agree=0.738, adj=0.083, (0 split)
##
## Node number 146: 26 observations, complexity param=0.01310044
## predicted class=0 expected loss=0.4615385 P(node) =0.0325
## class counts: 14 12
## probabilities: 0.538 0.462
## left son=292 (12 obs) right son=293 (14 obs)
## Primary splits:
## Amount < 3506.5 to the left, improve=1.9945050, (0 missing)
## Duration < 19 to the left, improve=1.3594410, (0 missing)
## EmploymentDuration.lt.1 < 0.5 to the right, improve=1.0341880, (0 missing)
## Age < 30.5 to the right, improve=0.8480769, (0 missing)
## Property.CarOther < 0.5 to the left, improve=0.6175214, (0 missing)
## Surrogate splits:
## Age < 26.5 to the left, agree=0.731, adj=0.417, (0 split)
## Duration < 19 to the left, agree=0.654, adj=0.250, (0 split)
## Telephone < 0.5 to the right, agree=0.654, adj=0.250, (0 split)
## Purpose.Radio.Television < 0.5 to the right, agree=0.654, adj=0.250, (0 split)
## Housing.Rent < 0.5 to the right, agree=0.654, adj=0.250, (0 split)
##
## Node number 147: 13 observations
## predicted class=1 expected loss=0.1538462 P(node) =0.01625
## class counts: 2 11
## probabilities: 0.154 0.846
##
## Node number 212: 32 observations, complexity param=0.01310044
## predicted class=0 expected loss=0.40625 P(node) =0.04
## class counts: 19 13
## probabilities: 0.594 0.406
## left son=424 (22 obs) right son=425 (10 obs)
## Primary splits:
## Age < 32.5 to the right, improve=2.5102270, (0 missing)
## Duration < 11.5 to the right, improve=1.7003570, (0 missing)
## CreditHistory.PaidDuly < 0.5 to the left, improve=1.1002450, (0 missing)
## Amount < 1383 to the left, improve=0.8481280, (0 missing)
## EmploymentDuration.gt.7 < 0.5 to the right, improve=0.5976732, (0 missing)
## Surrogate splits:
## EmploymentDuration.lt.1 < 0.5 to the left, agree=0.812, adj=0.4, (0 split)
## Duration < 7.5 to the right, agree=0.719, adj=0.1, (0 split)
## CreditHistory.PaidDuly < 0.5 to the left, agree=0.719, adj=0.1, (0 split)
##
## Node number 213: 20 observations
## predicted class=1 expected loss=0.25 P(node) =0.025
## class counts: 5 15
## probabilities: 0.250 0.750
##
## Node number 214: 24 observations, complexity param=0.01091703
## predicted class=1 expected loss=0.4166667 P(node) =0.03
## class counts: 10 14
## probabilities: 0.417 0.583
## left son=428 (7 obs) right son=429 (17 obs)
## Primary splits:
## Job.SkilledEmployee < 0.5 to the left, improve=3.8347340, (0 missing)
## Amount < 2814.5 to the left, improve=1.9603730, (0 missing)
## Age < 31.5 to the right, improve=1.1523810, (0 missing)
## Property.CarOther < 0.5 to the left, improve=1.1523810, (0 missing)
## NumberExistingCredits < 1.5 to the right, improve=0.6736597, (0 missing)
## Surrogate splits:
## Job.UnskilledResident < 0.5 to the right, agree=0.875, adj=0.571, (0 split)
## Amount < 6499 to the right, agree=0.833, adj=0.429, (0 split)
## InstallmentRatePercentage < 1.5 to the left, agree=0.792, adj=0.286, (0 split)
## Property.Insurance < 0.5 to the right, agree=0.792, adj=0.286, (0 split)
## OtherInstallmentPlans.Stores < 0.5 to the right, agree=0.792, adj=0.286, (0 split)
##
## Node number 215: 29 observations
## predicted class=1 expected loss=0.1034483 P(node) =0.03625
## class counts: 3 26
## probabilities: 0.103 0.897
##
## Node number 290: 30 observations, complexity param=0.01310044
## predicted class=0 expected loss=0.3666667 P(node) =0.0375
## class counts: 19 11
## probabilities: 0.633 0.367
## left son=580 (22 obs) right son=581 (8 obs)
## Primary splits:
## SavingsAccountBonds.lt.100 < 0.5 to the right, improve=3.206061, (0 missing)
## Purpose.Radio.Television < 0.5 to the left, improve=1.456061, (0 missing)
## Age < 28.5 to the right, improve=1.354148, (0 missing)
## Duration < 15 to the left, improve=1.274242, (0 missing)
## NumberExistingCredits < 1.5 to the right, improve=1.274242, (0 missing)
## Surrogate splits:
## SavingsAccountBonds.500.to.1000 < 0.5 to the left, agree=0.833, adj=0.375, (0 split)
## SavingsAccountBonds.gt.1000 < 0.5 to the left, agree=0.833, adj=0.375, (0 split)
## Age < 22.5 to the right, agree=0.800, adj=0.250, (0 split)
## OtherInstallmentPlans.Stores < 0.5 to the left, agree=0.800, adj=0.250, (0 split)
## Duration < 29 to the left, agree=0.767, adj=0.125, (0 split)
##
## Node number 291: 12 observations
## predicted class=1 expected loss=0.25 P(node) =0.015
## class counts: 3 9
## probabilities: 0.250 0.750
##
## Node number 292: 12 observations
## predicted class=0 expected loss=0.25 P(node) =0.015
## class counts: 9 3
## probabilities: 0.750 0.250
##
## Node number 293: 14 observations
## predicted class=1 expected loss=0.3571429 P(node) =0.0175
## class counts: 5 9
## probabilities: 0.357 0.643
##
## Node number 424: 22 observations
## predicted class=0 expected loss=0.2727273 P(node) =0.0275
## class counts: 16 6
## probabilities: 0.727 0.273
##
## Node number 425: 10 observations
## predicted class=1 expected loss=0.3 P(node) =0.0125
## class counts: 3 7
## probabilities: 0.300 0.700
##
## Node number 428: 7 observations
## predicted class=0 expected loss=0.1428571 P(node) =0.00875
## class counts: 6 1
## probabilities: 0.857 0.143
##
## Node number 429: 17 observations
## predicted class=1 expected loss=0.2352941 P(node) =0.02125
## class counts: 4 13
## probabilities: 0.235 0.765
##
## Node number 580: 22 observations
## predicted class=0 expected loss=0.2272727 P(node) =0.0275
## class counts: 17 5
## probabilities: 0.773 0.227
##
## Node number 581: 8 observations
## predicted class=1 expected loss=0.25 P(node) =0.01
## class counts: 2 6
## probabilities: 0.250 0.750
printcp(tree.model)
##
## Classification tree:
## rpart(formula = Class ~ ., data = train.df, method = "class")
##
## Variables actually used in tree construction:
## [1] Age Amount
## [3] CheckingAccountStatus.0.to.200 CheckingAccountStatus.lt.0
## [5] Duration InstallmentRatePercentage
## [7] Job.SkilledEmployee NumberExistingCredits
## [9] Property.RealEstate Purpose.Business
## [11] Purpose.UsedCar SavingsAccountBonds.100.to.500
## [13] SavingsAccountBonds.lt.100
##
## Root node error: 229/800 = 0.28625
##
## n= 800
##
## CP nsplit rel error xerror xstd
## 1 0.030568 0 1.00000 1.0000 0.055828
## 2 0.024017 4 0.87773 1.0131 0.056045
## 3 0.019651 8 0.77293 1.0044 0.055901
## 4 0.013100 10 0.73362 1.0175 0.056116
## 5 0.010917 19 0.60699 1.0131 0.056045
## 6 0.010000 21 0.58515 1.0087 0.055973
Your Answer: From the complexity parameter table, the training error decreases as the number of splits increases, indicating improved fit on the training data. However, the cross-validated error (xerror) does not decrease significantly and remains around 1, suggesting that increasing model complexity does not improve predictive performance and may lead to overfitting.
The variable importance results show that Amount and Duration are the most influential predictors in determining the classification outcome.
rpart.plot(tree.model)
train.pred <- predict(tree.model, train.df, type = "class")
head(train.pred)
## 578 549 557 700 255 913
## 1 0 1 1 0 0
## Levels: 0 1
table(Predicted = train.pred, Actual = train.df$Class)
## Actual
## Predicted 0 1
## 0 145 50
## 1 84 521
mean(train.pred != train.df$Class)
## [1] 0.1675
Your Answer: The confusion matrix indicates that the model correctly classified 145 observations in class 0 and 521 in class 1, while misclassifying 50 and 84 observations, respectively. The misclassification rate (MR) is 0.1675, meaning 16.75% of the training observations were incorrectly classified.
test.pred <- predict(tree.model, test.df, type = "class")
table(Predicted = test.pred, Actual = test.df$Class)
## Actual
## Predicted 0 1
## 0 36 26
## 1 35 103
mean(test.pred != test.df$Class)
## [1] 0.305
Your Answer: The confusion matrix indicates that the model correctly classified 36 observations in class 0 and 103 in class 1, while misclassifying 26 and 35 observations, respectively. The misclassification rate (MR) is 0.305, indicating that 30.5% of the testing observations were incorrectly classified, which suggests lower performance on unseen data compared to the training set.
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
test.prob <- predict(tree.model, test.df, type = "prob")[,2]
roc.obj <- roc(test.df$Class, test.prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.obj)
auc(roc.obj)
## Area under the curve: 0.6743
We will use the built-in mtcars dataset to predict miles per gallon (mpg) using other car characteristics. The dataset includes information about 32 cars from Motor Trend magazine (1973-74).
# Load the mtcars dataset
data(mtcars)
# Display the structure of the dataset
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
set.seed(2020)
train_index <- sample(1:nrow(mtcars), 0.8 * nrow(mtcars))
train.df <- mtcars[train_index, ]
test.df <- mtcars[-train_index, ]
tree.model <- rpart(mpg ~ ., data = train.df, method = "anova")
rpart.plot(tree.model)
Your Answer:The tree shows that the primary and only split is based on
the variable cyl.
Cars with 5 or more cylinders (cyl ≥ 5) fall into one group with a lower predicted mpg value of approximately 17, while cars with fewer than 5 cylinders (cyl < 5) fall into another group with a higher predicted mpg value of approximately 27.
This indicates that the number of cylinders is the most important factor in determining fuel efficiency in this model, with cars having fewer cylinders generally achieving better gas mileage.
train.pred <- predict(tree.model, train.df)
mean((train.df$mpg - train.pred)^2)
## [1] 12.79054
1 - sum((train.df$mpg - train.pred)^2) / sum((train.df$mpg - mean(train.df$mpg))^2)
## [1] 0.6145199
Your Answer: The Mean Squared Error is 12.79, indicating the average squared difference between the predicted and actual mpg values. The R-squared value is 0.611, meaning that approximately 61.1% of the variability in mpg is explained by the model.
test.pred <- predict(tree.model, test.df)
mean((test.df$mpg - test.pred)^2)
## [1] 12.31011
1 - sum((test.df$mpg - test.pred)^2) / sum((test.df$mpg - mean(test.df$mpg))^2)
## [1] 0.7085027
Your Answer: The Mean Squared Error (MSE) is 12.31, indicating the average squared difference between predicted and actual mpg values. The R-squared value is 0.709, meaning that approximately 70.9% of the variability in mpg is explained by the model on the testing set.