2— title: “Homework6 Tree models” author: “Tianhai Zu” date: “10/22/2023” output: html_document —

Part 1, Classification Tree Method (50pts in total)

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

1. Load the caret package and the GermanCredit dataset. (5pts)

library(caret) #this package contains the german data with its numeric format

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response，now 1 is good and 0 is bad.
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

#load tree model packages
library(rpart)
library(rpart.plot)

#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
GermanCredit_train = GermanCredit[index,]
GermanCredit_test = GermanCredit[-index,]

3. Fit a classification tree model (without extra parameters) using the training set with linear kernel. Please use all variables, but make sure the variable types (especially the response variable `Class`) are right. (10pts)

library(rpart)
library(rpart.plot)
# fit the tree
GermanCredit_tree <- rpart(Class ~ ., data = GermanCredit_train)
summary(GermanCredit_tree)

## Call:
## rpart(formula = Class ~ ., data = GermanCredit_train)
##   n= 800 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.03056769      0 1.0000000 1.000000 0.05582842
## 2 0.02401747      4 0.8777293 1.013100 0.05604510
## 3 0.01965066      8 0.7729258 1.004367 0.05590117
## 4 0.01310044     10 0.7336245 1.017467 0.05611630
## 5 0.01091703     19 0.6069869 1.013100 0.05604510
## 6 0.01000000     21 0.5851528 1.008734 0.05597339
## 
## Variable importance
##                           Amount                         Duration 
##                               15                               14 
##       CheckingAccountStatus.lt.0   CheckingAccountStatus.0.to.200 
##                               14                               14 
##                              Age       SavingsAccountBonds.lt.100 
##                                7                                4 
##                  Purpose.UsedCar              Job.SkilledEmployee 
##                                4                                3 
##        InstallmentRatePercentage              Property.RealEstate 
##                                3                                3 
##   SavingsAccountBonds.100.to.500     OtherInstallmentPlans.Stores 
##                                3                                2 
##            NumberExistingCredits                 Purpose.Business 
##                                2                                2 
##            Job.UnskilledResident          EmploymentDuration.lt.1 
##                                1                                1 
##      OtherDebtorsGuarantors.None                        Telephone 
##                                1                                1 
##  SavingsAccountBonds.500.to.1000      SavingsAccountBonds.gt.1000 
##                                1                                1 
## Personal.Male.Divorced.Seperated               Property.Insurance 
##                                1                                1 
##      Purpose.Furniture.Equipment 
##                                1 
## 
## Node number 1: 800 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.28625  P(node) =1
##     class counts:   229   571
##    probabilities: 0.286 0.714 
##   left son=2 (211 obs) right son=3 (589 obs)
##   Primary splits:
##       CheckingAccountStatus.lt.0     < 0.5    to the right, improve=21.222720, (0 missing)
##       Duration                       < 25.5   to the right, improve=13.584620, (0 missing)
##       Amount                         < 10918  to the right, improve=12.537530, (0 missing)
##       SavingsAccountBonds.lt.100     < 0.5    to the right, improve= 8.092071, (0 missing)
##       CreditHistory.ThisBank.AllPaid < 0.5    to the right, improve= 7.040837, (0 missing)
##   Surrogate splits:
##       Amount < 355.5  to the left,  agree=0.738, adj=0.005, (0 split)
## 
## Node number 2: 211 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.478673  P(node) =0.26375
##     class counts:   101   110
##    probabilities: 0.479 0.521 
##   left son=4 (178 obs) right son=5 (33 obs)
##   Primary splits:
##       Duration            < 11.5   to the right, improve=8.373770, (0 missing)
##       Amount              < 4802.5 to the right, improve=4.982836, (0 missing)
##       CreditHistory.Delay < 0.5    to the right, improve=3.726962, (0 missing)
##       Job.SkilledEmployee < 0.5    to the right, improve=3.414315, (0 missing)
##       ForeignWorker       < 0.5    to the right, improve=3.382024, (0 missing)
##   Surrogate splits:
##       Age    < 66.5   to the left,  agree=0.858, adj=0.091, (0 split)
##       Amount < 617.5  to the right, agree=0.853, adj=0.061, (0 split)
## 
## Node number 3: 589 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.2173175  P(node) =0.73625
##     class counts:   128   461
##    probabilities: 0.217 0.783 
##   left son=6 (210 obs) right son=7 (379 obs)
##   Primary splits:
##       CheckingAccountStatus.0.to.200 < 0.5    to the right, improve=20.662260, (0 missing)
##       Amount                         < 10918  to the right, improve=15.274340, (0 missing)
##       Duration                       < 25.5   to the right, improve= 8.276487, (0 missing)
##       OtherInstallmentPlans.Bank     < 0.5    to the right, improve= 5.258972, (0 missing)
##       Age                            < 25.5   to the left,  improve= 4.661922, (0 missing)
##   Surrogate splits:
##       Duration                       < 43.5   to the right, agree=0.660, adj=0.048, (0 split)
##       Amount                         < 11191  to the right, agree=0.660, adj=0.048, (0 split)
##       CreditHistory.NoCredit.AllPaid < 0.5    to the right, agree=0.654, adj=0.029, (0 split)
##       CreditHistory.ThisBank.AllPaid < 0.5    to the right, agree=0.652, adj=0.024, (0 split)
##       SavingsAccountBonds.100.to.500 < 0.5    to the right, agree=0.650, adj=0.019, (0 split)
## 
## Node number 4: 178 observations,    complexity param=0.02401747
##   predicted class=0  expected loss=0.4606742  P(node) =0.2225
##     class counts:    96    82
##    probabilities: 0.539 0.461 
##   left son=8 (38 obs) right son=9 (140 obs)
##   Primary splits:
##       Duration            < 31.5   to the right, improve=3.769739, (0 missing)
##       Job.SkilledEmployee < 0.5    to the right, improve=3.558204, (0 missing)
##       CreditHistory.Delay < 0.5    to the right, improve=2.756581, (0 missing)
##       Amount              < 4802.5 to the right, improve=2.493525, (0 missing)
##       Purpose.NewCar      < 0.5    to the right, improve=2.196990, (0 missing)
##   Surrogate splits:
##       Amount < 6668.5 to the right, agree=0.843, adj=0.263, (0 split)
## 
## Node number 5: 33 observations
##   predicted class=1  expected loss=0.1515152  P(node) =0.04125
##     class counts:     5    28
##    probabilities: 0.152 0.848 
## 
## Node number 6: 210 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.3952381  P(node) =0.2625
##     class counts:    83   127
##    probabilities: 0.395 0.605 
##   left son=12 (16 obs) right son=13 (194 obs)
##   Primary splits:
##       Amount              < 9908.5 to the right, improve=10.185580, (0 missing)
##       Duration            < 22.5   to the right, improve= 6.836080, (0 missing)
##       Property.RealEstate < 0.5    to the left,  improve= 6.773416, (0 missing)
##       Housing.Own         < 0.5    to the left,  improve= 4.050114, (0 missing)
##       Age                 < 25.5   to the left,  improve= 2.835462, (0 missing)
## 
## Node number 7: 379 observations
##   predicted class=1  expected loss=0.1187335  P(node) =0.47375
##     class counts:    45   334
##    probabilities: 0.119 0.881 
## 
## Node number 8: 38 observations
##   predicted class=0  expected loss=0.2631579  P(node) =0.0475
##     class counts:    28    10
##    probabilities: 0.737 0.263 
## 
## Node number 9: 140 observations,    complexity param=0.02401747
##   predicted class=1  expected loss=0.4857143  P(node) =0.175
##     class counts:    68    72
##    probabilities: 0.486 0.514 
##   left son=18 (129 obs) right son=19 (11 obs)
##   Primary splits:
##       Purpose.UsedCar           < 0.5    to the left,  improve=5.632780, (0 missing)
##       Amount                    < 1377   to the left,  improve=3.929252, (0 missing)
##       Purpose.NewCar            < 0.5    to the right, improve=3.629554, (0 missing)
##       Purpose.Business          < 0.5    to the left,  improve=2.208009, (0 missing)
##       InstallmentRatePercentage < 2.5    to the right, improve=1.545196, (0 missing)
##   Surrogate splits:
##       Age < 61.5   to the left,  agree=0.929, adj=0.091, (0 split)
## 
## Node number 12: 16 observations
##   predicted class=0  expected loss=0.0625  P(node) =0.02
##     class counts:    15     1
##    probabilities: 0.938 0.062 
## 
## Node number 13: 194 observations,    complexity param=0.01965066
##   predicted class=1  expected loss=0.3505155  P(node) =0.2425
##     class counts:    68   126
##    probabilities: 0.351 0.649 
##   left son=26 (136 obs) right son=27 (58 obs)
##   Primary splits:
##       Property.RealEstate            < 0.5    to the left,  improve=4.281722, (0 missing)
##       Duration                       < 22.5   to the right, improve=3.588005, (0 missing)
##       Age                            < 25.5   to the left,  improve=3.343549, (0 missing)
##       CreditHistory.ThisBank.AllPaid < 0.5    to the right, improve=2.575549, (0 missing)
##       OtherDebtorsGuarantors.None    < 0.5    to the right, improve=2.533807, (0 missing)
##   Surrogate splits:
##       OtherDebtorsGuarantors.None        < 0.5    to the right, agree=0.768, adj=0.224, (0 split)
##       Amount                             < 632    to the right, agree=0.716, adj=0.052, (0 split)
##       Age                                < 20.5   to the right, agree=0.706, adj=0.017, (0 split)
##       OtherDebtorsGuarantors.CoApplicant < 0.5    to the left,  agree=0.706, adj=0.017, (0 split)
##       Job.UnskilledResident              < 0.5    to the left,  agree=0.706, adj=0.017, (0 split)
## 
## Node number 18: 129 observations,    complexity param=0.02401747
##   predicted class=0  expected loss=0.4728682  P(node) =0.16125
##     class counts:    68    61
##    probabilities: 0.527 0.473 
##   left son=36 (121 obs) right son=37 (8 obs)
##   Primary splits:
##       Purpose.Business          < 0.5    to the left,  improve=2.758425, (0 missing)
##       Amount                    < 1377   to the left,  improve=2.425020, (0 missing)
##       Purpose.NewCar            < 0.5    to the right, improve=2.289513, (0 missing)
##       InstallmentRatePercentage < 2.5    to the right, improve=1.865086, (0 missing)
##       Age                       < 30.5   to the right, improve=1.542534, (0 missing)
## 
## Node number 19: 11 observations
##   predicted class=1  expected loss=0  P(node) =0.01375
##     class counts:     0    11
##    probabilities: 0.000 1.000 
## 
## Node number 26: 136 observations,    complexity param=0.01965066
##   predicted class=1  expected loss=0.4191176  P(node) =0.17
##     class counts:    57    79
##    probabilities: 0.419 0.581 
##   left son=52 (31 obs) right son=53 (105 obs)
##   Primary splits:
##       Age                  < 25.5   to the left,  improve=4.103230, (0 missing)
##       Personal.Male.Single < 0.5    to the left,  improve=3.308824, (0 missing)
##       Purpose.NewCar       < 0.5    to the right, improve=3.045537, (0 missing)
##       Housing.Rent         < 0.5    to the right, improve=2.499899, (0 missing)
##       Amount               < 931.5  to the left,  improve=2.272952, (0 missing)
##   Surrogate splits:
##       OtherDebtorsGuarantors.None < 0.5    to the left,  agree=0.794, adj=0.097, (0 split)
##       Duration                    < 54     to the right, agree=0.779, adj=0.032, (0 split)
##       Amount                      < 546.5  to the left,  agree=0.779, adj=0.032, (0 split)
## 
## Node number 27: 58 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.1896552  P(node) =0.0725
##     class counts:    11    47
##    probabilities: 0.190 0.810 
##   left son=54 (7 obs) right son=55 (51 obs)
##   Primary splits:
##       Duration                    < 22     to the right, improve=4.3822080, (0 missing)
##       Age                         < 31.5   to the left,  improve=2.1545090, (0 missing)
##       OtherDebtorsGuarantors.None < 0.5    to the right, improve=1.5894910, (0 missing)
##       Amount                      < 1221.5 to the right, improve=1.1907440, (0 missing)
##       Purpose.Furniture.Equipment < 0.5    to the right, improve=0.9088187, (0 missing)
##   Surrogate splits:
##       OtherInstallmentPlans.Stores     < 0.5    to the right, agree=0.914, adj=0.286, (0 split)
##       Personal.Male.Divorced.Seperated < 0.5    to the right, agree=0.897, adj=0.143, (0 split)
## 
## Node number 36: 121 observations,    complexity param=0.02401747
##   predicted class=0  expected loss=0.446281  P(node) =0.15125
##     class counts:    67    54
##    probabilities: 0.554 0.446 
##   left son=72 (82 obs) right son=73 (39 obs)
##   Primary splits:
##       InstallmentRatePercentage   < 2.5    to the right, improve=2.368882, (0 missing)
##       Purpose.Furniture.Equipment < 0.5    to the left,  improve=2.368882, (0 missing)
##       Amount                      < 1577.5 to the left,  improve=2.144262, (0 missing)
##       Purpose.NewCar              < 0.5    to the right, improve=1.585437, (0 missing)
##       OtherDebtorsGuarantors.None < 0.5    to the right, improve=1.149010, (0 missing)
##   Surrogate splits:
##       Amount                           < 3571   to the left,  agree=0.744, adj=0.205, (0 split)
##       Personal.Male.Divorced.Seperated < 0.5    to the left,  agree=0.744, adj=0.205, (0 split)
##       Duration                         < 29     to the left,  agree=0.686, adj=0.026, (0 split)
##       NumberExistingCredits            < 2.5    to the left,  agree=0.686, adj=0.026, (0 split)
##       Purpose.Furniture.Equipment      < 0.5    to the left,  agree=0.686, adj=0.026, (0 split)
## 
## Node number 37: 8 observations
##   predicted class=1  expected loss=0.125  P(node) =0.01
##     class counts:     1     7
##    probabilities: 0.125 0.875 
## 
## Node number 52: 31 observations
##   predicted class=0  expected loss=0.3548387  P(node) =0.03875
##     class counts:    20    11
##    probabilities: 0.645 0.355 
## 
## Node number 53: 105 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.352381  P(node) =0.13125
##     class counts:    37    68
##    probabilities: 0.352 0.648 
##   left son=106 (52 obs) right son=107 (53 obs)
##   Primary splits:
##       SavingsAccountBonds.lt.100 < 0.5    to the right, improve=2.455014, (0 missing)
##       Age                        < 48.5   to the right, improve=1.981837, (0 missing)
##       Amount                     < 931.5  to the left,  improve=1.964626, (0 missing)
##       Housing.Own                < 0.5    to the left,  improve=1.689451, (0 missing)
##       Personal.Male.Single       < 0.5    to the left,  improve=1.658566, (0 missing)
##   Surrogate splits:
##       SavingsAccountBonds.100.to.500 < 0.5    to the left,  agree=0.724, adj=0.442, (0 split)
##       Job.SkilledEmployee            < 0.5    to the left,  agree=0.610, adj=0.212, (0 split)
##       Age                            < 31.5   to the right, agree=0.600, adj=0.192, (0 split)
##       Telephone                      < 0.5    to the left,  agree=0.600, adj=0.192, (0 split)
##       Purpose.Furniture.Equipment    < 0.5    to the right, agree=0.571, adj=0.135, (0 split)
## 
## Node number 54: 7 observations
##   predicted class=0  expected loss=0.2857143  P(node) =0.00875
##     class counts:     5     2
##    probabilities: 0.714 0.286 
## 
## Node number 55: 51 observations
##   predicted class=1  expected loss=0.1176471  P(node) =0.06375
##     class counts:     6    45
##    probabilities: 0.118 0.882 
## 
## Node number 72: 82 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.3780488  P(node) =0.1025
##     class counts:    51    31
##    probabilities: 0.622 0.378 
##   left son=144 (40 obs) right son=145 (42 obs)
##   Primary splits:
##       Amount                   < 1577.5 to the left,  improve=1.658595, (0 missing)
##       Telephone                < 0.5    to the right, improve=1.449397, (0 missing)
##       Purpose.NewCar           < 0.5    to the right, improve=1.370800, (0 missing)
##       Purpose.Radio.Television < 0.5    to the left,  improve=1.132404, (0 missing)
##       Age                      < 55     to the left,  improve=1.081246, (0 missing)
##   Surrogate splits:
##       Purpose.Furniture.Equipment < 0.5    to the left,  agree=0.646, adj=0.275, (0 split)
##       Duration                    < 16.5   to the left,  agree=0.634, adj=0.250, (0 split)
##       InstallmentRatePercentage   < 3.5    to the right, agree=0.622, adj=0.225, (0 split)
##       Telephone                   < 0.5    to the right, agree=0.622, adj=0.225, (0 split)
##       Personal.Male.Single        < 0.5    to the left,  agree=0.622, adj=0.225, (0 split)
## 
## Node number 73: 39 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.4102564  P(node) =0.04875
##     class counts:    16    23
##    probabilities: 0.410 0.590 
##   left son=146 (26 obs) right son=147 (13 obs)
##   Primary splits:
##       Duration                < 15.5   to the right, improve=2.564103, (0 missing)
##       Telephone               < 0.5    to the left,  improve=1.538462, (0 missing)
##       EmploymentDuration.lt.1 < 0.5    to the right, improve=1.538462, (0 missing)
##       Age                     < 30.5   to the right, improve=1.257664, (0 missing)
##       Amount                  < 1961.5 to the right, improve=1.189036, (0 missing)
##   Surrogate splits:
##       Amount < 1828.5 to the right, agree=0.846, adj=0.538, (0 split)
##       Age    < 35     to the left,  agree=0.692, adj=0.077, (0 split)
## 
## Node number 106: 52 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.4615385  P(node) =0.065
##     class counts:    24    28
##    probabilities: 0.462 0.538 
##   left son=212 (32 obs) right son=213 (20 obs)
##   Primary splits:
##       NumberExistingCredits   < 1.5    to the left,  improve=2.908654, (0 missing)
##       Duration                < 28.5   to the right, improve=2.447658, (0 missing)
##       Age                     < 35.5   to the right, improve=1.846154, (0 missing)
##       ResidenceDuration       < 1.5    to the right, improve=1.246671, (0 missing)
##       EmploymentDuration.gt.7 < 0.5    to the right, improve=1.231775, (0 missing)
##   Surrogate splits:
##       Age                          < 27.5   to the right, agree=0.673, adj=0.15, (0 split)
##       InstallmentRatePercentage    < 1.5    to the right, agree=0.654, adj=0.10, (0 split)
##       CreditHistory.PaidDuly       < 0.5    to the right, agree=0.654, adj=0.10, (0 split)
##       OtherInstallmentPlans.Stores < 0.5    to the left,  agree=0.654, adj=0.10, (0 split)
##       Job.UnemployedUnskilled      < 0.5    to the left,  agree=0.654, adj=0.10, (0 split)
## 
## Node number 107: 53 observations,    complexity param=0.01091703
##   predicted class=1  expected loss=0.245283  P(node) =0.06625
##     class counts:    13    40
##    probabilities: 0.245 0.755 
##   left son=214 (24 obs) right son=215 (29 obs)
##   Primary splits:
##       SavingsAccountBonds.100.to.500 < 0.5    to the right, improve=2.576664, (0 missing)
##       Amount                         < 1930   to the left,  improve=1.693587, (0 missing)
##       EmploymentDuration.lt.1        < 0.5    to the right, improve=1.611103, (0 missing)
##       EmploymentDuration.1.to.4      < 0.5    to the left,  improve=1.334922, (0 missing)
##       NumberExistingCredits          < 1.5    to the right, improve=1.280380, (0 missing)
##   Surrogate splits:
##       Purpose.NewCar            < 0.5    to the right, agree=0.679, adj=0.292, (0 split)
##       EmploymentDuration.1.to.4 < 0.5    to the left,  agree=0.660, adj=0.250, (0 split)
##       EmploymentDuration.lt.1   < 0.5    to the right, agree=0.642, adj=0.208, (0 split)
##       Age                       < 31.5   to the left,  agree=0.604, adj=0.125, (0 split)
##       InstallmentRatePercentage < 1.5    to the left,  agree=0.585, adj=0.083, (0 split)
## 
## Node number 144: 40 observations
##   predicted class=0  expected loss=0.275  P(node) =0.05
##     class counts:    29    11
##    probabilities: 0.725 0.275 
## 
## Node number 145: 42 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.4761905  P(node) =0.0525
##     class counts:    22    20
##    probabilities: 0.524 0.476 
##   left son=290 (30 obs) right son=291 (12 obs)
##   Primary splits:
##       Amount                     < 2135.5 to the right, improve=2.5190480, (0 missing)
##       SavingsAccountBonds.lt.100 < 0.5    to the right, improve=2.0836940, (0 missing)
##       Housing.Own                < 0.5    to the left,  improve=1.2857140, (0 missing)
##       Age                        < 24.5   to the right, improve=0.9523810, (0 missing)
##       NumberExistingCredits      < 1.5    to the right, improve=0.6857143, (0 missing)
##   Surrogate splits:
##       ForeignWorker < 0.5    to the right, agree=0.738, adj=0.083, (0 split)
## 
## Node number 146: 26 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.4615385  P(node) =0.0325
##     class counts:    14    12
##    probabilities: 0.538 0.462 
##   left son=292 (12 obs) right son=293 (14 obs)
##   Primary splits:
##       Amount                  < 3506.5 to the left,  improve=1.9945050, (0 missing)
##       Duration                < 19     to the left,  improve=1.3594410, (0 missing)
##       EmploymentDuration.lt.1 < 0.5    to the right, improve=1.0341880, (0 missing)
##       Age                     < 30.5   to the right, improve=0.8480769, (0 missing)
##       Property.CarOther       < 0.5    to the left,  improve=0.6175214, (0 missing)
##   Surrogate splits:
##       Age                      < 26.5   to the left,  agree=0.731, adj=0.417, (0 split)
##       Duration                 < 19     to the left,  agree=0.654, adj=0.250, (0 split)
##       Telephone                < 0.5    to the right, agree=0.654, adj=0.250, (0 split)
##       Purpose.Radio.Television < 0.5    to the right, agree=0.654, adj=0.250, (0 split)
##       Housing.Rent             < 0.5    to the right, agree=0.654, adj=0.250, (0 split)
## 
## Node number 147: 13 observations
##   predicted class=1  expected loss=0.1538462  P(node) =0.01625
##     class counts:     2    11
##    probabilities: 0.154 0.846 
## 
## Node number 212: 32 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.40625  P(node) =0.04
##     class counts:    19    13
##    probabilities: 0.594 0.406 
##   left son=424 (22 obs) right son=425 (10 obs)
##   Primary splits:
##       Age                     < 32.5   to the right, improve=2.5102270, (0 missing)
##       Duration                < 11.5   to the right, improve=1.7003570, (0 missing)
##       CreditHistory.PaidDuly  < 0.5    to the left,  improve=1.1002450, (0 missing)
##       Amount                  < 1383   to the left,  improve=0.8481280, (0 missing)
##       EmploymentDuration.gt.7 < 0.5    to the right, improve=0.5976732, (0 missing)
##   Surrogate splits:
##       EmploymentDuration.lt.1 < 0.5    to the left,  agree=0.812, adj=0.4, (0 split)
##       Duration                < 7.5    to the right, agree=0.719, adj=0.1, (0 split)
##       CreditHistory.PaidDuly  < 0.5    to the left,  agree=0.719, adj=0.1, (0 split)
## 
## Node number 213: 20 observations
##   predicted class=1  expected loss=0.25  P(node) =0.025
##     class counts:     5    15
##    probabilities: 0.250 0.750 
## 
## Node number 214: 24 observations,    complexity param=0.01091703
##   predicted class=1  expected loss=0.4166667  P(node) =0.03
##     class counts:    10    14
##    probabilities: 0.417 0.583 
##   left son=428 (7 obs) right son=429 (17 obs)
##   Primary splits:
##       Job.SkilledEmployee   < 0.5    to the left,  improve=3.8347340, (0 missing)
##       Amount                < 2814.5 to the left,  improve=1.9603730, (0 missing)
##       Age                   < 31.5   to the right, improve=1.1523810, (0 missing)
##       Property.CarOther     < 0.5    to the left,  improve=1.1523810, (0 missing)
##       NumberExistingCredits < 1.5    to the right, improve=0.6736597, (0 missing)
##   Surrogate splits:
##       Job.UnskilledResident        < 0.5    to the right, agree=0.875, adj=0.571, (0 split)
##       Amount                       < 6499   to the right, agree=0.833, adj=0.429, (0 split)
##       InstallmentRatePercentage    < 1.5    to the left,  agree=0.792, adj=0.286, (0 split)
##       Property.Insurance           < 0.5    to the right, agree=0.792, adj=0.286, (0 split)
##       OtherInstallmentPlans.Stores < 0.5    to the right, agree=0.792, adj=0.286, (0 split)
## 
## Node number 215: 29 observations
##   predicted class=1  expected loss=0.1034483  P(node) =0.03625
##     class counts:     3    26
##    probabilities: 0.103 0.897 
## 
## Node number 290: 30 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.3666667  P(node) =0.0375
##     class counts:    19    11
##    probabilities: 0.633 0.367 
##   left son=580 (22 obs) right son=581 (8 obs)
##   Primary splits:
##       SavingsAccountBonds.lt.100 < 0.5    to the right, improve=3.206061, (0 missing)
##       Purpose.Radio.Television   < 0.5    to the left,  improve=1.456061, (0 missing)
##       Age                        < 28.5   to the right, improve=1.354148, (0 missing)
##       Duration                   < 15     to the left,  improve=1.274242, (0 missing)
##       NumberExistingCredits      < 1.5    to the right, improve=1.274242, (0 missing)
##   Surrogate splits:
##       SavingsAccountBonds.500.to.1000 < 0.5    to the left,  agree=0.833, adj=0.375, (0 split)
##       SavingsAccountBonds.gt.1000     < 0.5    to the left,  agree=0.833, adj=0.375, (0 split)
##       Age                             < 22.5   to the right, agree=0.800, adj=0.250, (0 split)
##       OtherInstallmentPlans.Stores    < 0.5    to the left,  agree=0.800, adj=0.250, (0 split)
##       Duration                        < 29     to the left,  agree=0.767, adj=0.125, (0 split)
## 
## Node number 291: 12 observations
##   predicted class=1  expected loss=0.25  P(node) =0.015
##     class counts:     3     9
##    probabilities: 0.250 0.750 
## 
## Node number 292: 12 observations
##   predicted class=0  expected loss=0.25  P(node) =0.015
##     class counts:     9     3
##    probabilities: 0.750 0.250 
## 
## Node number 293: 14 observations
##   predicted class=1  expected loss=0.3571429  P(node) =0.0175
##     class counts:     5     9
##    probabilities: 0.357 0.643 
## 
## Node number 424: 22 observations
##   predicted class=0  expected loss=0.2727273  P(node) =0.0275
##     class counts:    16     6
##    probabilities: 0.727 0.273 
## 
## Node number 425: 10 observations
##   predicted class=1  expected loss=0.3  P(node) =0.0125
##     class counts:     3     7
##    probabilities: 0.300 0.700 
## 
## Node number 428: 7 observations
##   predicted class=0  expected loss=0.1428571  P(node) =0.00875
##     class counts:     6     1
##    probabilities: 0.857 0.143 
## 
## Node number 429: 17 observations
##   predicted class=1  expected loss=0.2352941  P(node) =0.02125
##     class counts:     4    13
##    probabilities: 0.235 0.765 
## 
## Node number 580: 22 observations
##   predicted class=0  expected loss=0.2272727  P(node) =0.0275
##     class counts:    17     5
##    probabilities: 0.773 0.227 
## 
## Node number 581: 8 observations
##   predicted class=1  expected loss=0.25  P(node) =0.01
##     class counts:     2     6
##    probabilities: 0.250 0.750

Your observation:

4. Visualized the tree: (5pts)

rpart.plot(GermanCredit_tree,extra=1, yesno=2)

summary(GermanCredit_tree)

## Call:
## rpart(formula = Class ~ ., data = GermanCredit_train)
##   n= 800 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.03056769      0 1.0000000 1.000000 0.05582842
## 2 0.02401747      4 0.8777293 1.013100 0.05604510
## 3 0.01965066      8 0.7729258 1.004367 0.05590117
## 4 0.01310044     10 0.7336245 1.017467 0.05611630
## 5 0.01091703     19 0.6069869 1.013100 0.05604510
## 6 0.01000000     21 0.5851528 1.008734 0.05597339
## 
## Variable importance
##                           Amount                         Duration 
##                               15                               14 
##       CheckingAccountStatus.lt.0   CheckingAccountStatus.0.to.200 
##                               14                               14 
##                              Age       SavingsAccountBonds.lt.100 
##                                7                                4 
##                  Purpose.UsedCar              Job.SkilledEmployee 
##                                4                                3 
##        InstallmentRatePercentage              Property.RealEstate 
##                                3                                3 
##   SavingsAccountBonds.100.to.500     OtherInstallmentPlans.Stores 
##                                3                                2 
##            NumberExistingCredits                 Purpose.Business 
##                                2                                2 
##            Job.UnskilledResident          EmploymentDuration.lt.1 
##                                1                                1 
##      OtherDebtorsGuarantors.None                        Telephone 
##                                1                                1 
##  SavingsAccountBonds.500.to.1000      SavingsAccountBonds.gt.1000 
##                                1                                1 
## Personal.Male.Divorced.Seperated               Property.Insurance 
##                                1                                1 
##      Purpose.Furniture.Equipment 
##                                1 
## 
## Node number 1: 800 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.28625  P(node) =1
##     class counts:   229   571
##    probabilities: 0.286 0.714 
##   left son=2 (211 obs) right son=3 (589 obs)
##   Primary splits:
##       CheckingAccountStatus.lt.0     < 0.5    to the right, improve=21.222720, (0 missing)
##       Duration                       < 25.5   to the right, improve=13.584620, (0 missing)
##       Amount                         < 10918  to the right, improve=12.537530, (0 missing)
##       SavingsAccountBonds.lt.100     < 0.5    to the right, improve= 8.092071, (0 missing)
##       CreditHistory.ThisBank.AllPaid < 0.5    to the right, improve= 7.040837, (0 missing)
##   Surrogate splits:
##       Amount < 355.5  to the left,  agree=0.738, adj=0.005, (0 split)
## 
## Node number 2: 211 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.478673  P(node) =0.26375
##     class counts:   101   110
##    probabilities: 0.479 0.521 
##   left son=4 (178 obs) right son=5 (33 obs)
##   Primary splits:
##       Duration            < 11.5   to the right, improve=8.373770, (0 missing)
##       Amount              < 4802.5 to the right, improve=4.982836, (0 missing)
##       CreditHistory.Delay < 0.5    to the right, improve=3.726962, (0 missing)
##       Job.SkilledEmployee < 0.5    to the right, improve=3.414315, (0 missing)
##       ForeignWorker       < 0.5    to the right, improve=3.382024, (0 missing)
##   Surrogate splits:
##       Age    < 66.5   to the left,  agree=0.858, adj=0.091, (0 split)
##       Amount < 617.5  to the right, agree=0.853, adj=0.061, (0 split)
## 
## Node number 3: 589 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.2173175  P(node) =0.73625
##     class counts:   128   461
##    probabilities: 0.217 0.783 
##   left son=6 (210 obs) right son=7 (379 obs)
##   Primary splits:
##       CheckingAccountStatus.0.to.200 < 0.5    to the right, improve=20.662260, (0 missing)
##       Amount                         < 10918  to the right, improve=15.274340, (0 missing)
##       Duration                       < 25.5   to the right, improve= 8.276487, (0 missing)
##       OtherInstallmentPlans.Bank     < 0.5    to the right, improve= 5.258972, (0 missing)
##       Age                            < 25.5   to the left,  improve= 4.661922, (0 missing)
##   Surrogate splits:
##       Duration                       < 43.5   to the right, agree=0.660, adj=0.048, (0 split)
##       Amount                         < 11191  to the right, agree=0.660, adj=0.048, (0 split)
##       CreditHistory.NoCredit.AllPaid < 0.5    to the right, agree=0.654, adj=0.029, (0 split)
##       CreditHistory.ThisBank.AllPaid < 0.5    to the right, agree=0.652, adj=0.024, (0 split)
##       SavingsAccountBonds.100.to.500 < 0.5    to the right, agree=0.650, adj=0.019, (0 split)
## 
## Node number 4: 178 observations,    complexity param=0.02401747
##   predicted class=0  expected loss=0.4606742  P(node) =0.2225
##     class counts:    96    82
##    probabilities: 0.539 0.461 
##   left son=8 (38 obs) right son=9 (140 obs)
##   Primary splits:
##       Duration            < 31.5   to the right, improve=3.769739, (0 missing)
##       Job.SkilledEmployee < 0.5    to the right, improve=3.558204, (0 missing)
##       CreditHistory.Delay < 0.5    to the right, improve=2.756581, (0 missing)
##       Amount              < 4802.5 to the right, improve=2.493525, (0 missing)
##       Purpose.NewCar      < 0.5    to the right, improve=2.196990, (0 missing)
##   Surrogate splits:
##       Amount < 6668.5 to the right, agree=0.843, adj=0.263, (0 split)
## 
## Node number 5: 33 observations
##   predicted class=1  expected loss=0.1515152  P(node) =0.04125
##     class counts:     5    28
##    probabilities: 0.152 0.848 
## 
## Node number 6: 210 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.3952381  P(node) =0.2625
##     class counts:    83   127
##    probabilities: 0.395 0.605 
##   left son=12 (16 obs) right son=13 (194 obs)
##   Primary splits:
##       Amount              < 9908.5 to the right, improve=10.185580, (0 missing)
##       Duration            < 22.5   to the right, improve= 6.836080, (0 missing)
##       Property.RealEstate < 0.5    to the left,  improve= 6.773416, (0 missing)
##       Housing.Own         < 0.5    to the left,  improve= 4.050114, (0 missing)
##       Age                 < 25.5   to the left,  improve= 2.835462, (0 missing)
## 
## Node number 7: 379 observations
##   predicted class=1  expected loss=0.1187335  P(node) =0.47375
##     class counts:    45   334
##    probabilities: 0.119 0.881 
## 
## Node number 8: 38 observations
##   predicted class=0  expected loss=0.2631579  P(node) =0.0475
##     class counts:    28    10
##    probabilities: 0.737 0.263 
## 
## Node number 9: 140 observations,    complexity param=0.02401747
##   predicted class=1  expected loss=0.4857143  P(node) =0.175
##     class counts:    68    72
##    probabilities: 0.486 0.514 
##   left son=18 (129 obs) right son=19 (11 obs)
##   Primary splits:
##       Purpose.UsedCar           < 0.5    to the left,  improve=5.632780, (0 missing)
##       Amount                    < 1377   to the left,  improve=3.929252, (0 missing)
##       Purpose.NewCar            < 0.5    to the right, improve=3.629554, (0 missing)
##       Purpose.Business          < 0.5    to the left,  improve=2.208009, (0 missing)
##       InstallmentRatePercentage < 2.5    to the right, improve=1.545196, (0 missing)
##   Surrogate splits:
##       Age < 61.5   to the left,  agree=0.929, adj=0.091, (0 split)
## 
## Node number 12: 16 observations
##   predicted class=0  expected loss=0.0625  P(node) =0.02
##     class counts:    15     1
##    probabilities: 0.938 0.062 
## 
## Node number 13: 194 observations,    complexity param=0.01965066
##   predicted class=1  expected loss=0.3505155  P(node) =0.2425
##     class counts:    68   126
##    probabilities: 0.351 0.649 
##   left son=26 (136 obs) right son=27 (58 obs)
##   Primary splits:
##       Property.RealEstate            < 0.5    to the left,  improve=4.281722, (0 missing)
##       Duration                       < 22.5   to the right, improve=3.588005, (0 missing)
##       Age                            < 25.5   to the left,  improve=3.343549, (0 missing)
##       CreditHistory.ThisBank.AllPaid < 0.5    to the right, improve=2.575549, (0 missing)
##       OtherDebtorsGuarantors.None    < 0.5    to the right, improve=2.533807, (0 missing)
##   Surrogate splits:
##       OtherDebtorsGuarantors.None        < 0.5    to the right, agree=0.768, adj=0.224, (0 split)
##       Amount                             < 632    to the right, agree=0.716, adj=0.052, (0 split)
##       Age                                < 20.5   to the right, agree=0.706, adj=0.017, (0 split)
##       OtherDebtorsGuarantors.CoApplicant < 0.5    to the left,  agree=0.706, adj=0.017, (0 split)
##       Job.UnskilledResident              < 0.5    to the left,  agree=0.706, adj=0.017, (0 split)
## 
## Node number 18: 129 observations,    complexity param=0.02401747
##   predicted class=0  expected loss=0.4728682  P(node) =0.16125
##     class counts:    68    61
##    probabilities: 0.527 0.473 
##   left son=36 (121 obs) right son=37 (8 obs)
##   Primary splits:
##       Purpose.Business          < 0.5    to the left,  improve=2.758425, (0 missing)
##       Amount                    < 1377   to the left,  improve=2.425020, (0 missing)
##       Purpose.NewCar            < 0.5    to the right, improve=2.289513, (0 missing)
##       InstallmentRatePercentage < 2.5    to the right, improve=1.865086, (0 missing)
##       Age                       < 30.5   to the right, improve=1.542534, (0 missing)
## 
## Node number 19: 11 observations
##   predicted class=1  expected loss=0  P(node) =0.01375
##     class counts:     0    11
##    probabilities: 0.000 1.000 
## 
## Node number 26: 136 observations,    complexity param=0.01965066
##   predicted class=1  expected loss=0.4191176  P(node) =0.17
##     class counts:    57    79
##    probabilities: 0.419 0.581 
##   left son=52 (31 obs) right son=53 (105 obs)
##   Primary splits:
##       Age                  < 25.5   to the left,  improve=4.103230, (0 missing)
##       Personal.Male.Single < 0.5    to the left,  improve=3.308824, (0 missing)
##       Purpose.NewCar       < 0.5    to the right, improve=3.045537, (0 missing)
##       Housing.Rent         < 0.5    to the right, improve=2.499899, (0 missing)
##       Amount               < 931.5  to the left,  improve=2.272952, (0 missing)
##   Surrogate splits:
##       OtherDebtorsGuarantors.None < 0.5    to the left,  agree=0.794, adj=0.097, (0 split)
##       Duration                    < 54     to the right, agree=0.779, adj=0.032, (0 split)
##       Amount                      < 546.5  to the left,  agree=0.779, adj=0.032, (0 split)
## 
## Node number 27: 58 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.1896552  P(node) =0.0725
##     class counts:    11    47
##    probabilities: 0.190 0.810 
##   left son=54 (7 obs) right son=55 (51 obs)
##   Primary splits:
##       Duration                    < 22     to the right, improve=4.3822080, (0 missing)
##       Age                         < 31.5   to the left,  improve=2.1545090, (0 missing)
##       OtherDebtorsGuarantors.None < 0.5    to the right, improve=1.5894910, (0 missing)
##       Amount                      < 1221.5 to the right, improve=1.1907440, (0 missing)
##       Purpose.Furniture.Equipment < 0.5    to the right, improve=0.9088187, (0 missing)
##   Surrogate splits:
##       OtherInstallmentPlans.Stores     < 0.5    to the right, agree=0.914, adj=0.286, (0 split)
##       Personal.Male.Divorced.Seperated < 0.5    to the right, agree=0.897, adj=0.143, (0 split)
## 
## Node number 36: 121 observations,    complexity param=0.02401747
##   predicted class=0  expected loss=0.446281  P(node) =0.15125
##     class counts:    67    54
##    probabilities: 0.554 0.446 
##   left son=72 (82 obs) right son=73 (39 obs)
##   Primary splits:
##       InstallmentRatePercentage   < 2.5    to the right, improve=2.368882, (0 missing)
##       Purpose.Furniture.Equipment < 0.5    to the left,  improve=2.368882, (0 missing)
##       Amount                      < 1577.5 to the left,  improve=2.144262, (0 missing)
##       Purpose.NewCar              < 0.5    to the right, improve=1.585437, (0 missing)
##       OtherDebtorsGuarantors.None < 0.5    to the right, improve=1.149010, (0 missing)
##   Surrogate splits:
##       Amount                           < 3571   to the left,  agree=0.744, adj=0.205, (0 split)
##       Personal.Male.Divorced.Seperated < 0.5    to the left,  agree=0.744, adj=0.205, (0 split)
##       Duration                         < 29     to the left,  agree=0.686, adj=0.026, (0 split)
##       NumberExistingCredits            < 2.5    to the left,  agree=0.686, adj=0.026, (0 split)
##       Purpose.Furniture.Equipment      < 0.5    to the left,  agree=0.686, adj=0.026, (0 split)
## 
## Node number 37: 8 observations
##   predicted class=1  expected loss=0.125  P(node) =0.01
##     class counts:     1     7
##    probabilities: 0.125 0.875 
## 
## Node number 52: 31 observations
##   predicted class=0  expected loss=0.3548387  P(node) =0.03875
##     class counts:    20    11
##    probabilities: 0.645 0.355 
## 
## Node number 53: 105 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.352381  P(node) =0.13125
##     class counts:    37    68
##    probabilities: 0.352 0.648 
##   left son=106 (52 obs) right son=107 (53 obs)
##   Primary splits:
##       SavingsAccountBonds.lt.100 < 0.5    to the right, improve=2.455014, (0 missing)
##       Age                        < 48.5   to the right, improve=1.981837, (0 missing)
##       Amount                     < 931.5  to the left,  improve=1.964626, (0 missing)
##       Housing.Own                < 0.5    to the left,  improve=1.689451, (0 missing)
##       Personal.Male.Single       < 0.5    to the left,  improve=1.658566, (0 missing)
##   Surrogate splits:
##       SavingsAccountBonds.100.to.500 < 0.5    to the left,  agree=0.724, adj=0.442, (0 split)
##       Job.SkilledEmployee            < 0.5    to the left,  agree=0.610, adj=0.212, (0 split)
##       Age                            < 31.5   to the right, agree=0.600, adj=0.192, (0 split)
##       Telephone                      < 0.5    to the left,  agree=0.600, adj=0.192, (0 split)
##       Purpose.Furniture.Equipment    < 0.5    to the right, agree=0.571, adj=0.135, (0 split)
## 
## Node number 54: 7 observations
##   predicted class=0  expected loss=0.2857143  P(node) =0.00875
##     class counts:     5     2
##    probabilities: 0.714 0.286 
## 
## Node number 55: 51 observations
##   predicted class=1  expected loss=0.1176471  P(node) =0.06375
##     class counts:     6    45
##    probabilities: 0.118 0.882 
## 
## Node number 72: 82 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.3780488  P(node) =0.1025
##     class counts:    51    31
##    probabilities: 0.622 0.378 
##   left son=144 (40 obs) right son=145 (42 obs)
##   Primary splits:
##       Amount                   < 1577.5 to the left,  improve=1.658595, (0 missing)
##       Telephone                < 0.5    to the right, improve=1.449397, (0 missing)
##       Purpose.NewCar           < 0.5    to the right, improve=1.370800, (0 missing)
##       Purpose.Radio.Television < 0.5    to the left,  improve=1.132404, (0 missing)
##       Age                      < 55     to the left,  improve=1.081246, (0 missing)
##   Surrogate splits:
##       Purpose.Furniture.Equipment < 0.5    to the left,  agree=0.646, adj=0.275, (0 split)
##       Duration                    < 16.5   to the left,  agree=0.634, adj=0.250, (0 split)
##       InstallmentRatePercentage   < 3.5    to the right, agree=0.622, adj=0.225, (0 split)
##       Telephone                   < 0.5    to the right, agree=0.622, adj=0.225, (0 split)
##       Personal.Male.Single        < 0.5    to the left,  agree=0.622, adj=0.225, (0 split)
## 
## Node number 73: 39 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.4102564  P(node) =0.04875
##     class counts:    16    23
##    probabilities: 0.410 0.590 
##   left son=146 (26 obs) right son=147 (13 obs)
##   Primary splits:
##       Duration                < 15.5   to the right, improve=2.564103, (0 missing)
##       Telephone               < 0.5    to the left,  improve=1.538462, (0 missing)
##       EmploymentDuration.lt.1 < 0.5    to the right, improve=1.538462, (0 missing)
##       Age                     < 30.5   to the right, improve=1.257664, (0 missing)
##       Amount                  < 1961.5 to the right, improve=1.189036, (0 missing)
##   Surrogate splits:
##       Amount < 1828.5 to the right, agree=0.846, adj=0.538, (0 split)
##       Age    < 35     to the left,  agree=0.692, adj=0.077, (0 split)
## 
## Node number 106: 52 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.4615385  P(node) =0.065
##     class counts:    24    28
##    probabilities: 0.462 0.538 
##   left son=212 (32 obs) right son=213 (20 obs)
##   Primary splits:
##       NumberExistingCredits   < 1.5    to the left,  improve=2.908654, (0 missing)
##       Duration                < 28.5   to the right, improve=2.447658, (0 missing)
##       Age                     < 35.5   to the right, improve=1.846154, (0 missing)
##       ResidenceDuration       < 1.5    to the right, improve=1.246671, (0 missing)
##       EmploymentDuration.gt.7 < 0.5    to the right, improve=1.231775, (0 missing)
##   Surrogate splits:
##       Age                          < 27.5   to the right, agree=0.673, adj=0.15, (0 split)
##       InstallmentRatePercentage    < 1.5    to the right, agree=0.654, adj=0.10, (0 split)
##       CreditHistory.PaidDuly       < 0.5    to the right, agree=0.654, adj=0.10, (0 split)
##       OtherInstallmentPlans.Stores < 0.5    to the left,  agree=0.654, adj=0.10, (0 split)
##       Job.UnemployedUnskilled      < 0.5    to the left,  agree=0.654, adj=0.10, (0 split)
## 
## Node number 107: 53 observations,    complexity param=0.01091703
##   predicted class=1  expected loss=0.245283  P(node) =0.06625
##     class counts:    13    40
##    probabilities: 0.245 0.755 
##   left son=214 (24 obs) right son=215 (29 obs)
##   Primary splits:
##       SavingsAccountBonds.100.to.500 < 0.5    to the right, improve=2.576664, (0 missing)
##       Amount                         < 1930   to the left,  improve=1.693587, (0 missing)
##       EmploymentDuration.lt.1        < 0.5    to the right, improve=1.611103, (0 missing)
##       EmploymentDuration.1.to.4      < 0.5    to the left,  improve=1.334922, (0 missing)
##       NumberExistingCredits          < 1.5    to the right, improve=1.280380, (0 missing)
##   Surrogate splits:
##       Purpose.NewCar            < 0.5    to the right, agree=0.679, adj=0.292, (0 split)
##       EmploymentDuration.1.to.4 < 0.5    to the left,  agree=0.660, adj=0.250, (0 split)
##       EmploymentDuration.lt.1   < 0.5    to the right, agree=0.642, adj=0.208, (0 split)
##       Age                       < 31.5   to the left,  agree=0.604, adj=0.125, (0 split)
##       InstallmentRatePercentage < 1.5    to the left,  agree=0.585, adj=0.083, (0 split)
## 
## Node number 144: 40 observations
##   predicted class=0  expected loss=0.275  P(node) =0.05
##     class counts:    29    11
##    probabilities: 0.725 0.275 
## 
## Node number 145: 42 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.4761905  P(node) =0.0525
##     class counts:    22    20
##    probabilities: 0.524 0.476 
##   left son=290 (30 obs) right son=291 (12 obs)
##   Primary splits:
##       Amount                     < 2135.5 to the right, improve=2.5190480, (0 missing)
##       SavingsAccountBonds.lt.100 < 0.5    to the right, improve=2.0836940, (0 missing)
##       Housing.Own                < 0.5    to the left,  improve=1.2857140, (0 missing)
##       Age                        < 24.5   to the right, improve=0.9523810, (0 missing)
##       NumberExistingCredits      < 1.5    to the right, improve=0.6857143, (0 missing)
##   Surrogate splits:
##       ForeignWorker < 0.5    to the right, agree=0.738, adj=0.083, (0 split)
## 
## Node number 146: 26 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.4615385  P(node) =0.0325
##     class counts:    14    12
##    probabilities: 0.538 0.462 
##   left son=292 (12 obs) right son=293 (14 obs)
##   Primary splits:
##       Amount                  < 3506.5 to the left,  improve=1.9945050, (0 missing)
##       Duration                < 19     to the left,  improve=1.3594410, (0 missing)
##       EmploymentDuration.lt.1 < 0.5    to the right, improve=1.0341880, (0 missing)
##       Age                     < 30.5   to the right, improve=0.8480769, (0 missing)
##       Property.CarOther       < 0.5    to the left,  improve=0.6175214, (0 missing)
##   Surrogate splits:
##       Age                      < 26.5   to the left,  agree=0.731, adj=0.417, (0 split)
##       Duration                 < 19     to the left,  agree=0.654, adj=0.250, (0 split)
##       Telephone                < 0.5    to the right, agree=0.654, adj=0.250, (0 split)
##       Purpose.Radio.Television < 0.5    to the right, agree=0.654, adj=0.250, (0 split)
##       Housing.Rent             < 0.5    to the right, agree=0.654, adj=0.250, (0 split)
## 
## Node number 147: 13 observations
##   predicted class=1  expected loss=0.1538462  P(node) =0.01625
##     class counts:     2    11
##    probabilities: 0.154 0.846 
## 
## Node number 212: 32 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.40625  P(node) =0.04
##     class counts:    19    13
##    probabilities: 0.594 0.406 
##   left son=424 (22 obs) right son=425 (10 obs)
##   Primary splits:
##       Age                     < 32.5   to the right, improve=2.5102270, (0 missing)
##       Duration                < 11.5   to the right, improve=1.7003570, (0 missing)
##       CreditHistory.PaidDuly  < 0.5    to the left,  improve=1.1002450, (0 missing)
##       Amount                  < 1383   to the left,  improve=0.8481280, (0 missing)
##       EmploymentDuration.gt.7 < 0.5    to the right, improve=0.5976732, (0 missing)
##   Surrogate splits:
##       EmploymentDuration.lt.1 < 0.5    to the left,  agree=0.812, adj=0.4, (0 split)
##       Duration                < 7.5    to the right, agree=0.719, adj=0.1, (0 split)
##       CreditHistory.PaidDuly  < 0.5    to the left,  agree=0.719, adj=0.1, (0 split)
## 
## Node number 213: 20 observations
##   predicted class=1  expected loss=0.25  P(node) =0.025
##     class counts:     5    15
##    probabilities: 0.250 0.750 
## 
## Node number 214: 24 observations,    complexity param=0.01091703
##   predicted class=1  expected loss=0.4166667  P(node) =0.03
##     class counts:    10    14
##    probabilities: 0.417 0.583 
##   left son=428 (7 obs) right son=429 (17 obs)
##   Primary splits:
##       Job.SkilledEmployee   < 0.5    to the left,  improve=3.8347340, (0 missing)
##       Amount                < 2814.5 to the left,  improve=1.9603730, (0 missing)
##       Age                   < 31.5   to the right, improve=1.1523810, (0 missing)
##       Property.CarOther     < 0.5    to the left,  improve=1.1523810, (0 missing)
##       NumberExistingCredits < 1.5    to the right, improve=0.6736597, (0 missing)
##   Surrogate splits:
##       Job.UnskilledResident        < 0.5    to the right, agree=0.875, adj=0.571, (0 split)
##       Amount                       < 6499   to the right, agree=0.833, adj=0.429, (0 split)
##       InstallmentRatePercentage    < 1.5    to the left,  agree=0.792, adj=0.286, (0 split)
##       Property.Insurance           < 0.5    to the right, agree=0.792, adj=0.286, (0 split)
##       OtherInstallmentPlans.Stores < 0.5    to the right, agree=0.792, adj=0.286, (0 split)
## 
## Node number 215: 29 observations
##   predicted class=1  expected loss=0.1034483  P(node) =0.03625
##     class counts:     3    26
##    probabilities: 0.103 0.897 
## 
## Node number 290: 30 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.3666667  P(node) =0.0375
##     class counts:    19    11
##    probabilities: 0.633 0.367 
##   left son=580 (22 obs) right son=581 (8 obs)
##   Primary splits:
##       SavingsAccountBonds.lt.100 < 0.5    to the right, improve=3.206061, (0 missing)
##       Purpose.Radio.Television   < 0.5    to the left,  improve=1.456061, (0 missing)
##       Age                        < 28.5   to the right, improve=1.354148, (0 missing)
##       Duration                   < 15     to the left,  improve=1.274242, (0 missing)
##       NumberExistingCredits      < 1.5    to the right, improve=1.274242, (0 missing)
##   Surrogate splits:
##       SavingsAccountBonds.500.to.1000 < 0.5    to the left,  agree=0.833, adj=0.375, (0 split)
##       SavingsAccountBonds.gt.1000     < 0.5    to the left,  agree=0.833, adj=0.375, (0 split)
##       Age                             < 22.5   to the right, agree=0.800, adj=0.250, (0 split)
##       OtherInstallmentPlans.Stores    < 0.5    to the left,  agree=0.800, adj=0.250, (0 split)
##       Duration                        < 29     to the left,  agree=0.767, adj=0.125, (0 split)
## 
## Node number 291: 12 observations
##   predicted class=1  expected loss=0.25  P(node) =0.015
##     class counts:     3     9
##    probabilities: 0.250 0.750 
## 
## Node number 292: 12 observations
##   predicted class=0  expected loss=0.25  P(node) =0.015
##     class counts:     9     3
##    probabilities: 0.750 0.250 
## 
## Node number 293: 14 observations
##   predicted class=1  expected loss=0.3571429  P(node) =0.0175
##     class counts:     5     9
##    probabilities: 0.357 0.643 
## 
## Node number 424: 22 observations
##   predicted class=0  expected loss=0.2727273  P(node) =0.0275
##     class counts:    16     6
##    probabilities: 0.727 0.273 
## 
## Node number 425: 10 observations
##   predicted class=1  expected loss=0.3  P(node) =0.0125
##     class counts:     3     7
##    probabilities: 0.300 0.700 
## 
## Node number 428: 7 observations
##   predicted class=0  expected loss=0.1428571  P(node) =0.00875
##     class counts:     6     1
##    probabilities: 0.857 0.143 
## 
## Node number 429: 17 observations
##   predicted class=1  expected loss=0.2352941  P(node) =0.02125
##     class counts:     4    13
##    probabilities: 0.235 0.765 
## 
## Node number 580: 22 observations
##   predicted class=0  expected loss=0.2272727  P(node) =0.0275
##     class counts:    17     5
##    probabilities: 0.773 0.227 
## 
## Node number 581: 8 observations
##   predicted class=1  expected loss=0.25  P(node) =0.01
##     class counts:     2     6
##    probabilities: 0.250 0.750

5. Use the training set to get prediected classes. (5pts)

# Make predictions on the training sets
GermanCredit_pred.train <- predict(GermanCredit_tree, GermanCredit_train)
summary(GermanCredit_pred.train)

##        0                1         
##  Min.   :0.0000   Min.   :0.0625  
##  1st Qu.:0.1187   1st Qu.:0.6429  
##  Median :0.1187   Median :0.8813  
##  Mean   :0.2863   Mean   :0.7137  
##  3rd Qu.:0.3571   3rd Qu.:0.8813  
##  Max.   :0.9375   Max.   :1.0000

6. Obtain confusion matrix and MR on training set. (5pts)

# predictions on the training set with predicted classes
GermanCredit_pred.train <- predict(GermanCredit_tree, GermanCredit_train, type = "class")
GermanCredit_pred.train

##  578  549  557  700  255  913  621  416  105  634  738   29   11  784  925   62 
##    1    0    1    1    0    0    1    1    1    1    1    1    0    0    1    1 
##  252  398  930   26  172  562  410   32  725  385  203   35  361  238  593  284 
##    1    0    0    1    1    0    1    1    1    1    1    1    1    0    1    1 
##  304  216  596  476  852  427  442  884  276  951   87  505  997  618  892  900 
##    1    1    0    0    1    1    1    1    1    1    1    0    1    1    1    1 
##  647  948  441  336  212  835  281  290  217  825  817  310  858  643  153  705 
##    1    1    1    1    1    1    1    0    1    1    1    1    1    1    1    1 
##    6  788  393  719  717  464  963  354  186  305  627  108  261  720  902  131 
##    1    1    0    1    1    1    1    0    1    1    1    1    1    1    1    0 
##  938  459  723  414  329  189  259  541  954  747  960  445  334  528  548  209 
##    1    1    0    1    1    0    1    1    1    1    1    1    1    1    1    1 
##  585  935  752  118  891  402  875  674  147  652  834  873  987  173  702  454 
##    1    0    0    1    1    1    1    1    1    0    1    1    1    0    0    1 
##   68  543  795  113  463  827  932  736  483  635  943  504  888   94  446  765 
##    1    1    1    1    1    1    0    1    1    0    1    1    0    1    1    1 
##  982  270  715  457  661  706  266  896  346   34  625  187 1000  411  976  901 
##    1    1    0    1    1    1    0    1    1    1    0    0    1    0    1    0 
##  737  770  611  109  999  826  805  469  897  369  119  568  789  676  576  766 
##    0    1    0    1    0    1    1    1    0    0    0    1    0    1    1    1 
##   80   31  425  278  868  899  642  269  586  321   51  249  856  818  185  641 
##    0    1    0    1    1    1    1    1    0    1    1    1    1    1    0    0 
##  808  247  776   16  955  133  679  513  387  206   24  600  649  348  846  995 
##    1    1    0    0    1    1    0    1    1    1    1    1    1    0    1    1 
##   60  388  666  980  292  275  664  675  477  927  871  421   25  712  154  520 
##    0    1    1    0    1    1    1    1    1    0    1    1    1    1    1    1 
##  861  316  589  326   65  350  314  553  778  103  159  920  673  265  754  115 
##    1    0    0    1    1    1    0    0    0    1    1    0    1    1    1    1 
##   59  564  508  225  830  709  224  638  409  175  521  946  461   95  244  204 
##    1    0    0    1    1    1    1    1    1    0    1    1    0    1    1    0 
##  364  669  792  619  467  245  917  991  139  640  929  768  144  613  468  135 
##    1    0    1    0    1    1    1    1    1    0    1    1    0    1    1    1 
##  362  122  535  531  798  620   90  176  975  478  178  489  179  610  104  487 
##    1    1    1    1    1    1    0    1    1    1    1    1    1    1    0    1 
##  263  599  831  242  887  366   71  384  340  591  291  220  594  527  228  970 
##    1    1    1    1    1    1    1    1    1    0    1    1    0    1    0    1 
##  500  219  419  730  726  854  672  306  268  449  761   77  150  615  222  289 
##    1    0    1    1    1    0    1    1    1    1    1    0    1    1    0    1 
##  860  435  437  962  933  996  202   78  655   70  785  947  658   93  941  998 
##    1    1    1    1    1    1    0    0    1    1    1    1    1    1    1    1 
##  481  685  495  880  967   96  235  412  968  491  315  277  240   58  308  569 
##    1    0    1    1    0    0    1    1    1    1    1    1    0    1    0    1 
##  213  237  196  735   84  479  694  499  574  550  584  756  341  125  200  691 
##    1    0    1    1    1    1    1    1    1    1    0    0    0    1    0    1 
##  355  839  554  501   42  894  563  952  471  684  432  359  128  763  631   54 
##    1    1    1    0    1    1    1    0    0    1    0    1    1    0    1    1 
##  916  551  254  949  786  182   28  874   49  188  984  232  210  807  799  405 
##    0    1    1    1    1    0    1    1    1    1    0    1    1    1    1    1 
##   50  974  510  161  841   30  815  886  624  130  708  524  745  390  710  327 
##    1    0    1    1    0    0    0    0    1    0    0    1    0    1    1    1 
##  389  760  829  403  466  429  299  170  408  668  297  395  363  287   86  677 
##    1    0    0    1    1    1    1    0    0    1    1    1    1    0    1    1 
##  865  570  253  136  703  956  804  248  571  332  124  796  191   66  688  488 
##    1    0    0    1    1    0    1    1    0    1    1    1    1    1    1    1 
##  958  211  511  582  813  285  264  626  522  651    3  680  881  803  988  904 
##    1    1    0    1    0    1    1    1    0    0    1    1    1    1    1    1 
##  678  729  117   19  689  298  580  507  101  914  250  465    7  877  957   37 
##    0    0    0    0    1    1    1    1    1    1    1    1    1    1    1    1 
##  451  309  184  323  836   39  490  503  692  134  722    8   15  971   99  663 
##    1    1    1    1    0    1    1    1    1    1    0    0    1    1    0    1 
##  426  138  417  573  221  201  246  629  622   73  157  538   43  882  516   79 
##    0    1    0    1    1    1    1    1    1    1    1    0    1    1    1    1 
##  227  903  462  812  950  231   75  140  711  989  749  698    1  607  923  819 
##    0    1    1    1    1    1    0    1    1    1    1    1    1    1    1    0 
##  994  283  205  842  274  849  351  145  386  783  226  360  373  575   52   48 
##    0    1    1    1    0    1    1    1    1    1    1    0    1    1    1    1 
##  707  683  840  714  879  324  660  151  837  937  727  375  605   18  482   33 
##    0    1    1    1    1    1    0    1    1    1    1    0    1    1    1    1 
##  695  169   98  572  744  966  567  512  823  759  579  912  517  530  866  422 
##    1    1    1    1    1    1    1    1    0    1    1    0    1    1    1    1 
##  905  152  258  302  197  177  450  713  751  936  368  654  293  907  160  322 
##    1    1    0    0    1    1    1    1    1    0    1    0    1    1    1    1 
##  486  732  547  833  271  539  940  780  637  337   27  870   61  944  746  764 
##    0    0    1    0    1    0    1    0    1    0    1    0    1    1    1    1 
##  595  379   36  116  383   88  519  431  127  223  146  965  614  241  972   22 
##    0    0    0    1    1    0    1    1    0    1    1    0    1    0    1    1 
##   47  632  657  129  328   21   38  928  990  267  869  229   53  338  993   92 
##    1    0    0    1    1    1    1    0    1    1    1    1    1    0    0    1 
##  514  617  779  319   55  604  606  979  162  142  301  367  243  194  311  494 
##    1    1    1    1    1    1    1    1    1    1    1    1    0    1    1    1 
##  828  256  910  370  644  400  609  294  452  413  851  750  601  908  774  757 
##    1    0    1    1    1    1    1    1    1    1    0    1    1    1    1    1 
##  645  895  392  347  401  820  493  876  166  682  515  498  755  148  455  646 
##    1    1    1    1    1    0    1    1    1    1    1    1    1    1    1    1 
##  855  506  475  372    2   85  959  342  537  824  656  848  650  295  898    5 
##    1    1    0    1    0    1    0    1    1    1    1    1    0    1    1    0 
##  406  616  438  257  378  404  953  667  801  806  439  565  782  460  436   76 
##    0    0    1    1    1    1    1    1    1    0    0    1    1    1    1    1 
##  791  890  889   83  811  365  509  546  282  357  448  909  121  345    9  536 
##    0    1    1    1    1    0    1    0    1    1    1    1    1    1    1    1 
##   12  356  193  325  192  317  344  163  181  485  198  864   40  353  718  214 
##    0    0    0    1    0    1    1    1    1    1    1    1    1    1    1    1 
##  969  561  158  333  560  648  636  787  981  132  559  190  853  918  773  234 
##    1    1    0    0    1    1    1    1    0    0    0    1    1    1    1    1 
##  693  123  985  734  724  300  566  623   82   46  590  800  296  444   81  423 
##    1    1    1    1    1    1    0    1    1    1    0    1    0    1    1    1 
##  330  977  961   97   23    4  838  931  922  382   72  687  681   14  696  358 
##    1    1    1    1    1    0    1    1    1    0    1    1    1    0    1    1 
##  168  456  313  832  542  111  407  484  612  628  639  518  307  339  492  767 
##    1    1    1    0    1    1    1    1    1    1    1    1    1    0    1    0 
## Levels: 0 1

# Create confusion matrix
confusion_train <- table(true = GermanCredit_train$Class, pred = GermanCredit_pred.train)
confusion_train

##     pred
## true   0   1
##    0 145  84
##    1  50 521

# Calculate the Misclassification Rate (MR)
MR_train <- 1 - sum(diag(confusion_train)) / sum(confusion_train)
MR_train

## [1] 0.1675

Your observation:

The MR of 0.167 for the training set shows that the model misclassified 16.7% of the observations, achieving an accuracy of 83.3%. This relatively low error rate suggests that the model fits the training data well.

7. Use the testing set to get prediected classes. (5pts)

# Use the testing set to predict classes
GermanCredit_pred_test <- predict(GermanCredit_tree, GermanCredit_test, type = "class")
GermanCredit_pred_test

##  10  13  17  20  41  44  45  56  57  63  64  67  69  74  89  91 100 102 106 107 
##   1   0   1   1   1   1   0   1   1   0   0   1   1   0   1   1   1   0   0   1 
## 110 112 114 120 126 137 141 143 149 155 156 164 165 167 171 174 180 183 195 199 
##   1   1   1   1   1   1   1   0   0   0   1   0   1   0   0   1   0   1   0   1 
## 207 208 215 218 230 233 236 239 251 260 262 272 273 279 280 286 288 303 312 318 
##   1   1   1   1   0   1   1   1   1   1   1   1   0   1   1   0   0   1   1   1 
## 320 331 335 343 349 352 371 374 376 377 380 381 391 394 396 397 399 415 418 420 
##   1   1   0   1   1   1   1   1   0   1   1   1   1   1   0   0   1   0   1   1 
## 424 428 430 433 434 440 443 447 453 458 470 472 473 474 480 496 497 502 523 525 
##   1   1   0   1   1   1   1   0   1   1   1   1   1   1   0   0   0   0   0   1 
## 526 529 532 533 534 540 544 545 552 555 556 558 577 581 583 587 588 592 597 598 
##   1   0   1   1   1   1   1   1   1   1   0   1   1   1   1   1   0   1   0   0 
## 602 603 608 630 633 653 659 662 665 670 671 686 690 697 699 701 704 716 721 728 
##   1   0   1   1   0   0   1   0   1   1   1   1   1   1   1   1   1   1   1   1 
## 731 733 739 740 741 742 743 748 753 758 762 769 771 772 775 777 781 790 793 794 
##   0   1   1   0   0   1   1   0   1   1   1   1   1   0   1   1   0   0   1   1 
## 797 802 809 810 814 816 821 822 843 844 845 847 850 857 859 862 863 867 872 878 
##   1   1   0   1   0   0   1   1   1   1   1   1   0   1   0   1   0   0   1   1 
## 883 885 893 906 911 915 919 921 924 926 934 939 942 945 964 973 978 983 986 992 
##   0   0   0   1   1   1   0   1   1   0   1   0   1   1   1   0   1   1   0   1 
## Levels: 0 1

8. Obtain confusion matrix and MR on testing set. (5pts)

# Confusion matrix for the testing set
confusion_test <- table(true = GermanCredit_test$Class, pred = GermanCredit_pred_test)
confusion_test

##     pred
## true   0   1
##    0  36  35
##    1  26 103

# Calculate the MR for the testing set
MR_test <- 1 - sum(diag(confusion_test)) / sum(confusion_test)
MR_test

## [1] 0.305

Your observation: The MR of 0.305 indicates that 30.5% of the observations in the testing set were misclassified, meaning the model correctly classified 69.5% of the data. This performance could be improved by exploring other techniques or using alternative models such as random forests or SVM.

9 Obtain the ROC and AUC for testing data (not training). (5pts)

library(ROCR)

# Obtain predicted probabilities 
GermanCredit_pred_prob_test <- predict(GermanCredit_tree, GermanCredit_test, type = "prob")[, 2]

# Generate prediction 
pred_test <- prediction(GermanCredit_pred_prob_test, GermanCredit_test$Class)
roc_test <- performance(pred_test, "tpr", "fpr")

# ROC curve
plot(roc_test, colorize = TRUE, main = "ROC Curve Testing Set")

# Calculate and display AUC
auc_test <- performance(pred_test, "auc")
auc_test_value <- unlist(slot(auc_test, "y.values"))
auc_test_value

## [1] 0.6742548

10. (optional) use cp or other parameters to prune the tree see if you can get a better testing MR and testing AUC.

Part 2, Regression Tree Method on mtcar data (50pts in total)

Starter code for mtcars dataset

We will use the built-in mtcars dataset to predict miles per gallon (mpg) using other car characteristics. The dataset includes information about 32 cars from Motor Trend magazine (1973-74).

0. load the data (5pts)

# Load the mtcars dataset
data(mtcars)
# Display the structure of the dataset
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

1. Split the dataset into training and test set with 85-15 split. Use set.seed(2024) for reproducibility. (5pts)

set.seed(2024)

# Splitting
index <- sample(1:nrow(mtcars), size = floor(0.85 * nrow(mtcars)))
mtcars_train <- mtcars[index, ]  
mtcars_test <- mtcars[-index, ]  

# Display the dimensions
dim(mtcars_train)

## [1] 27 11

dim(mtcars_test)

## [1]  5 11

2. Fit a basic regression tree model using the training set with mpg as the response variable. Set method = “anova”. (10pts)

library(rpart)

# regression tree
mtcars_tree <- rpart(mpg ~ ., data = mtcars_train, method = "anova")
summary(mtcars_tree)

## Call:
## rpart(formula = mpg ~ ., data = mtcars_train, method = "anova")
##   n= 27 
## 
##          CP nsplit rel error    xerror      xstd
## 1 0.6121479      0 1.0000000 1.1010927 0.2728919
## 2 0.0100000      1 0.3878521 0.7898291 0.1866011
## 
## Variable importance
##  cyl disp   hp qsec   vs   wt 
##   20   18   18   14   14   14 
## 
## Node number 1: 27 observations,    complexity param=0.6121479
##   mean=20.50741, MSE=37.20809 
##   left son=2 (17 obs) right son=3 (10 obs)
##   Primary splits:
##       cyl  < 5      to the right, improve=0.6121479, (0 missing)
##       hp   < 118    to the right, improve=0.6068166, (0 missing)
##       wt   < 3.325  to the right, improve=0.5916267, (0 missing)
##       disp < 120.55 to the right, improve=0.5838435, (0 missing)
##       vs   < 0.5    to the left,  improve=0.5158466, (0 missing)
##   Surrogate splits:
##       disp < 142.9  to the right, agree=0.963, adj=0.9, (0 split)
##       hp   < 109.5  to the right, agree=0.963, adj=0.9, (0 split)
##       wt   < 2.5425 to the right, agree=0.889, adj=0.7, (0 split)
##       qsec < 18.41  to the left,  agree=0.889, adj=0.7, (0 split)
##       vs   < 0.5    to the left,  agree=0.889, adj=0.7, (0 split)
## 
## Node number 2: 17 observations
##   mean=16.84706, MSE=10.98484 
## 
## Node number 3: 10 observations
##   mean=26.73, MSE=20.2901

2. Visualize the tree using rpart.plot. Interpret the splits. (10pts)

library(rpart.plot)

# Plot the tree
rpart.plot(mtcars_tree, type = 3, digits = 2, fallen.leaves = TRUE, main = "Regression Tree for mpg")

mtcars_tree

## n= 27 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 27 1004.6190 20.50741  
##   2) cyl>=5 17  186.7424 16.84706 *
##   3) cyl< 5 10  202.9010 26.73000 *

Your observation:

Root Node Insights: The dataset contains 27 observations, with an average mpg of 20.51 and a total deviance of 1004.62.
First Split: The split is based on the number of cylinders cyl, dividing the cars into those with cyl >= 5 and cyl < 5.
Node for cyl >= 5: Includes 17 cars with an average mpg of 16.85 and a reduced deviance of 186.74. This is a terminal node.
Node for cyl < 5: Includes 10 cars with an average mpg of 26.73 and a reduced deviance of 202.90. This is also a terminal node.
Key Finding: cyl is the most important predictor, effectively splitting the data into two distinct groups based on mpg, with cars having fewer cylinders achieving higher fuel efficiency.

3. Make predictions and calculate MSE and R-squared on training set. (10pts)

# predictions on the training set
mtcars_train_predictions <- predict(mtcars_tree, mtcars_train)

# Calculate MSE
MSE_train <- mean((mtcars_train$mpg - mtcars_train_predictions)^2)
MSE_train

## [1] 14.43124

# Calculate R-squared
SS_total <- sum((mtcars_train$mpg - mean(mtcars_train$mpg))^2)
SS_residual <- sum((mtcars_train$mpg - mtcars_train_predictions)^2)  
R_squared_train <- 1 - (SS_residual / SS_total)
R_squared_train

## [1] 0.6121479

Your observation: - MSE: 14.431, which reflects the average squared differences between the predicted and actual mpg values. While not perfect, it suggests the tree provides a reasonable fit to the training data.

R-squared: 0.612, meaning approximately 61.2% of the variance in mpg is explained by the regression tree. This indicates a moderate level of explanatory power, with room for improvement.

4. Make predictions and calculate MSE and R-squared on testing set. (10pts)

# Make predictions
pred_test <- predict(mtcars_tree, newdata = mtcars_test)

# Calculate MSE 
mse_test <- mean((mtcars_test$mpg - pred_test)^2)

# Calculate R-squared for the test set
sst <- sum((mtcars_test$mpg - mean(mtcars_test$mpg))^2)  
ssr <- sum((mtcars_test$mpg - pred_test)^2)          
r2_test <- 1 - (ssr / sst)

# Results
mse_test

## [1] 2.619646

r2_test

## [1] 0.8567122

Your observation:

The MSE for the training set is 14.431, indicating a moderate average squared difference between the actual and predicted mpg values. The R-squared value of 0.612 shows that the model explains approximately 61.2% of the variance in mpg for the training data. While the model demonstrates a decent fit, the results suggest there is room for improvement in prediction accuracy, potentially by refining the model or exploring more complex methods.

The MSE for the testing set is 2.62, reflecting a much lower average squared difference between the actual and predicted mpg values compared to the training set. The R-squared value of 0.857 indicates that the model explains approximately 85.7% of the variance in mpg for the testing data. These results suggest that the model performs significantly better on the testing set, demonstrating strong predictive accuracy and generalizability.

Part 3 (Optional): Please recall the results from previous homework, how do you compare them? Just discuss.

Part 1, Classification Tree Method (50pts in total)

Starter code for German credit scoring

1. Load the caret package and the GermanCredit dataset. (5pts)

2. Split the dataset into training and test set with 80-20 split. Please use the random seed as 2024 for reproducibility. (5pts)

3. Fit a classification tree model (without extra parameters) using the training set with linear kernel. Please use all variables, but make sure the variable types (especially the response variable Class) are right. (10pts)

4. Visualized the tree: (5pts)

5. Use the training set to get prediected classes. (5pts)

6. Obtain confusion matrix and MR on training set. (5pts)

7. Use the testing set to get prediected classes. (5pts)

8. Obtain confusion matrix and MR on testing set. (5pts)

9 Obtain the ROC and AUC for testing data (not training). (5pts)

10. (optional) use cp or other parameters to prune the tree see if you can get a better testing MR and testing AUC.

Part 2, Regression Tree Method on mtcar data (50pts in total)

Starter code for mtcars dataset

0. load the data (5pts)

1. Split the dataset into training and test set with 85-15 split. Use set.seed(2024) for reproducibility. (5pts)

2. Fit a basic regression tree model using the training set with mpg as the response variable. Set method = “anova”. (10pts)

2. Visualize the tree using rpart.plot. Interpret the splits. (10pts)

3. Make predictions and calculate MSE and R-squared on training set. (10pts)

4. Make predictions and calculate MSE and R-squared on testing set. (10pts)

Part 3 (Optional): Please recall the results from previous homework, how do you compare them? Just discuss.

2. Split the dataset into training and test set with 80-20 split. Please use the random seed as `2024` for reproducibility. (5pts)

3. Fit a classification tree model (without extra parameters) using the training set with linear kernel. Please use all variables, but make sure the variable types (especially the response variable `Class`) are right. (10pts)