Part 1, Classification Tree Method (50pts in total)

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

1. Load the caret package and the GermanCredit dataset. (5pts)

library(caret) #this package contains the german data with its numeric format
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Loading required package: lattice
data(GermanCredit)
#code the Class variable into 1 and 0, True means 1 and 0 means Bad.
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") 
GermanCredit$Class <- as.factor(GermanCredit$Class) 
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...
#load tree model packages
library(rpart)

library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
#This is the code that drop variables that provide no useful information in the data
#Only run this code chunk ONCE.
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set with 80-20 split. Please use the random seed as 2024 for reproducibility. (5pts)

set.seed(2024)

train_index <- sample(1:nrow(GermanCredit), 0.8 * nrow(GermanCredit))

train.df <- GermanCredit[train_index, ]
test.df  <- GermanCredit[-train_index, ]

3. Fit a classification tree model (without extra parameters) using the training set with linear kernel. Please use all variables, but make sure the variable types (especially the response variable Class) are right. (10pts)

train.df$Class <- as.factor(train.df$Class)
test.df$Class  <- as.factor(test.df$Class)
tree.model <- rpart(Class ~ ., data = train.df, method = "class")
summary(tree.model)
## Call:
## rpart(formula = Class ~ ., data = train.df, method = "class")
##   n= 800 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.03056769      0 1.0000000 1.000000 0.05582842
## 2 0.02401747      4 0.8777293 1.013100 0.05604510
## 3 0.01965066      8 0.7729258 1.004367 0.05590117
## 4 0.01310044     10 0.7336245 1.017467 0.05611630
## 5 0.01091703     19 0.6069869 1.013100 0.05604510
## 6 0.01000000     21 0.5851528 1.008734 0.05597339
## 
## Variable importance
##                           Amount                         Duration 
##                               15                               14 
##       CheckingAccountStatus.lt.0   CheckingAccountStatus.0.to.200 
##                               14                               14 
##                              Age       SavingsAccountBonds.lt.100 
##                                7                                4 
##                  Purpose.UsedCar              Job.SkilledEmployee 
##                                4                                3 
##        InstallmentRatePercentage              Property.RealEstate 
##                                3                                3 
##   SavingsAccountBonds.100.to.500     OtherInstallmentPlans.Stores 
##                                3                                2 
##            NumberExistingCredits                 Purpose.Business 
##                                2                                2 
##            Job.UnskilledResident          EmploymentDuration.lt.1 
##                                1                                1 
##      OtherDebtorsGuarantors.None                        Telephone 
##                                1                                1 
##  SavingsAccountBonds.500.to.1000      SavingsAccountBonds.gt.1000 
##                                1                                1 
## Personal.Male.Divorced.Seperated               Property.Insurance 
##                                1                                1 
##      Purpose.Furniture.Equipment 
##                                1 
## 
## Node number 1: 800 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.28625  P(node) =1
##     class counts:   229   571
##    probabilities: 0.286 0.714 
##   left son=2 (211 obs) right son=3 (589 obs)
##   Primary splits:
##       CheckingAccountStatus.lt.0     < 0.5    to the right, improve=21.222720, (0 missing)
##       Duration                       < 25.5   to the right, improve=13.584620, (0 missing)
##       Amount                         < 10918  to the right, improve=12.537530, (0 missing)
##       SavingsAccountBonds.lt.100     < 0.5    to the right, improve= 8.092071, (0 missing)
##       CreditHistory.ThisBank.AllPaid < 0.5    to the right, improve= 7.040837, (0 missing)
##   Surrogate splits:
##       Amount < 355.5  to the left,  agree=0.738, adj=0.005, (0 split)
## 
## Node number 2: 211 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.478673  P(node) =0.26375
##     class counts:   101   110
##    probabilities: 0.479 0.521 
##   left son=4 (178 obs) right son=5 (33 obs)
##   Primary splits:
##       Duration            < 11.5   to the right, improve=8.373770, (0 missing)
##       Amount              < 4802.5 to the right, improve=4.982836, (0 missing)
##       CreditHistory.Delay < 0.5    to the right, improve=3.726962, (0 missing)
##       Job.SkilledEmployee < 0.5    to the right, improve=3.414315, (0 missing)
##       ForeignWorker       < 0.5    to the right, improve=3.382024, (0 missing)
##   Surrogate splits:
##       Age    < 66.5   to the left,  agree=0.858, adj=0.091, (0 split)
##       Amount < 617.5  to the right, agree=0.853, adj=0.061, (0 split)
## 
## Node number 3: 589 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.2173175  P(node) =0.73625
##     class counts:   128   461
##    probabilities: 0.217 0.783 
##   left son=6 (210 obs) right son=7 (379 obs)
##   Primary splits:
##       CheckingAccountStatus.0.to.200 < 0.5    to the right, improve=20.662260, (0 missing)
##       Amount                         < 10918  to the right, improve=15.274340, (0 missing)
##       Duration                       < 25.5   to the right, improve= 8.276487, (0 missing)
##       OtherInstallmentPlans.Bank     < 0.5    to the right, improve= 5.258972, (0 missing)
##       Age                            < 25.5   to the left,  improve= 4.661922, (0 missing)
##   Surrogate splits:
##       Duration                       < 43.5   to the right, agree=0.660, adj=0.048, (0 split)
##       Amount                         < 11191  to the right, agree=0.660, adj=0.048, (0 split)
##       CreditHistory.NoCredit.AllPaid < 0.5    to the right, agree=0.654, adj=0.029, (0 split)
##       CreditHistory.ThisBank.AllPaid < 0.5    to the right, agree=0.652, adj=0.024, (0 split)
##       SavingsAccountBonds.100.to.500 < 0.5    to the right, agree=0.650, adj=0.019, (0 split)
## 
## Node number 4: 178 observations,    complexity param=0.02401747
##   predicted class=0  expected loss=0.4606742  P(node) =0.2225
##     class counts:    96    82
##    probabilities: 0.539 0.461 
##   left son=8 (38 obs) right son=9 (140 obs)
##   Primary splits:
##       Duration            < 31.5   to the right, improve=3.769739, (0 missing)
##       Job.SkilledEmployee < 0.5    to the right, improve=3.558204, (0 missing)
##       CreditHistory.Delay < 0.5    to the right, improve=2.756581, (0 missing)
##       Amount              < 4802.5 to the right, improve=2.493525, (0 missing)
##       Purpose.NewCar      < 0.5    to the right, improve=2.196990, (0 missing)
##   Surrogate splits:
##       Amount < 6668.5 to the right, agree=0.843, adj=0.263, (0 split)
## 
## Node number 5: 33 observations
##   predicted class=1  expected loss=0.1515152  P(node) =0.04125
##     class counts:     5    28
##    probabilities: 0.152 0.848 
## 
## Node number 6: 210 observations,    complexity param=0.03056769
##   predicted class=1  expected loss=0.3952381  P(node) =0.2625
##     class counts:    83   127
##    probabilities: 0.395 0.605 
##   left son=12 (16 obs) right son=13 (194 obs)
##   Primary splits:
##       Amount              < 9908.5 to the right, improve=10.185580, (0 missing)
##       Duration            < 22.5   to the right, improve= 6.836080, (0 missing)
##       Property.RealEstate < 0.5    to the left,  improve= 6.773416, (0 missing)
##       Housing.Own         < 0.5    to the left,  improve= 4.050114, (0 missing)
##       Age                 < 25.5   to the left,  improve= 2.835462, (0 missing)
## 
## Node number 7: 379 observations
##   predicted class=1  expected loss=0.1187335  P(node) =0.47375
##     class counts:    45   334
##    probabilities: 0.119 0.881 
## 
## Node number 8: 38 observations
##   predicted class=0  expected loss=0.2631579  P(node) =0.0475
##     class counts:    28    10
##    probabilities: 0.737 0.263 
## 
## Node number 9: 140 observations,    complexity param=0.02401747
##   predicted class=1  expected loss=0.4857143  P(node) =0.175
##     class counts:    68    72
##    probabilities: 0.486 0.514 
##   left son=18 (129 obs) right son=19 (11 obs)
##   Primary splits:
##       Purpose.UsedCar           < 0.5    to the left,  improve=5.632780, (0 missing)
##       Amount                    < 1377   to the left,  improve=3.929252, (0 missing)
##       Purpose.NewCar            < 0.5    to the right, improve=3.629554, (0 missing)
##       Purpose.Business          < 0.5    to the left,  improve=2.208009, (0 missing)
##       InstallmentRatePercentage < 2.5    to the right, improve=1.545196, (0 missing)
##   Surrogate splits:
##       Age < 61.5   to the left,  agree=0.929, adj=0.091, (0 split)
## 
## Node number 12: 16 observations
##   predicted class=0  expected loss=0.0625  P(node) =0.02
##     class counts:    15     1
##    probabilities: 0.938 0.062 
## 
## Node number 13: 194 observations,    complexity param=0.01965066
##   predicted class=1  expected loss=0.3505155  P(node) =0.2425
##     class counts:    68   126
##    probabilities: 0.351 0.649 
##   left son=26 (136 obs) right son=27 (58 obs)
##   Primary splits:
##       Property.RealEstate            < 0.5    to the left,  improve=4.281722, (0 missing)
##       Duration                       < 22.5   to the right, improve=3.588005, (0 missing)
##       Age                            < 25.5   to the left,  improve=3.343549, (0 missing)
##       CreditHistory.ThisBank.AllPaid < 0.5    to the right, improve=2.575549, (0 missing)
##       OtherDebtorsGuarantors.None    < 0.5    to the right, improve=2.533807, (0 missing)
##   Surrogate splits:
##       OtherDebtorsGuarantors.None        < 0.5    to the right, agree=0.768, adj=0.224, (0 split)
##       Amount                             < 632    to the right, agree=0.716, adj=0.052, (0 split)
##       Age                                < 20.5   to the right, agree=0.706, adj=0.017, (0 split)
##       OtherDebtorsGuarantors.CoApplicant < 0.5    to the left,  agree=0.706, adj=0.017, (0 split)
##       Job.UnskilledResident              < 0.5    to the left,  agree=0.706, adj=0.017, (0 split)
## 
## Node number 18: 129 observations,    complexity param=0.02401747
##   predicted class=0  expected loss=0.4728682  P(node) =0.16125
##     class counts:    68    61
##    probabilities: 0.527 0.473 
##   left son=36 (121 obs) right son=37 (8 obs)
##   Primary splits:
##       Purpose.Business          < 0.5    to the left,  improve=2.758425, (0 missing)
##       Amount                    < 1377   to the left,  improve=2.425020, (0 missing)
##       Purpose.NewCar            < 0.5    to the right, improve=2.289513, (0 missing)
##       InstallmentRatePercentage < 2.5    to the right, improve=1.865086, (0 missing)
##       Age                       < 30.5   to the right, improve=1.542534, (0 missing)
## 
## Node number 19: 11 observations
##   predicted class=1  expected loss=0  P(node) =0.01375
##     class counts:     0    11
##    probabilities: 0.000 1.000 
## 
## Node number 26: 136 observations,    complexity param=0.01965066
##   predicted class=1  expected loss=0.4191176  P(node) =0.17
##     class counts:    57    79
##    probabilities: 0.419 0.581 
##   left son=52 (31 obs) right son=53 (105 obs)
##   Primary splits:
##       Age                  < 25.5   to the left,  improve=4.103230, (0 missing)
##       Personal.Male.Single < 0.5    to the left,  improve=3.308824, (0 missing)
##       Purpose.NewCar       < 0.5    to the right, improve=3.045537, (0 missing)
##       Housing.Rent         < 0.5    to the right, improve=2.499899, (0 missing)
##       Amount               < 931.5  to the left,  improve=2.272952, (0 missing)
##   Surrogate splits:
##       OtherDebtorsGuarantors.None < 0.5    to the left,  agree=0.794, adj=0.097, (0 split)
##       Duration                    < 54     to the right, agree=0.779, adj=0.032, (0 split)
##       Amount                      < 546.5  to the left,  agree=0.779, adj=0.032, (0 split)
## 
## Node number 27: 58 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.1896552  P(node) =0.0725
##     class counts:    11    47
##    probabilities: 0.190 0.810 
##   left son=54 (7 obs) right son=55 (51 obs)
##   Primary splits:
##       Duration                    < 22     to the right, improve=4.3822080, (0 missing)
##       Age                         < 31.5   to the left,  improve=2.1545090, (0 missing)
##       OtherDebtorsGuarantors.None < 0.5    to the right, improve=1.5894910, (0 missing)
##       Amount                      < 1221.5 to the right, improve=1.1907440, (0 missing)
##       Purpose.Furniture.Equipment < 0.5    to the right, improve=0.9088187, (0 missing)
##   Surrogate splits:
##       OtherInstallmentPlans.Stores     < 0.5    to the right, agree=0.914, adj=0.286, (0 split)
##       Personal.Male.Divorced.Seperated < 0.5    to the right, agree=0.897, adj=0.143, (0 split)
## 
## Node number 36: 121 observations,    complexity param=0.02401747
##   predicted class=0  expected loss=0.446281  P(node) =0.15125
##     class counts:    67    54
##    probabilities: 0.554 0.446 
##   left son=72 (82 obs) right son=73 (39 obs)
##   Primary splits:
##       InstallmentRatePercentage   < 2.5    to the right, improve=2.368882, (0 missing)
##       Purpose.Furniture.Equipment < 0.5    to the left,  improve=2.368882, (0 missing)
##       Amount                      < 1577.5 to the left,  improve=2.144262, (0 missing)
##       Purpose.NewCar              < 0.5    to the right, improve=1.585437, (0 missing)
##       OtherDebtorsGuarantors.None < 0.5    to the right, improve=1.149010, (0 missing)
##   Surrogate splits:
##       Amount                           < 3571   to the left,  agree=0.744, adj=0.205, (0 split)
##       Personal.Male.Divorced.Seperated < 0.5    to the left,  agree=0.744, adj=0.205, (0 split)
##       Duration                         < 29     to the left,  agree=0.686, adj=0.026, (0 split)
##       NumberExistingCredits            < 2.5    to the left,  agree=0.686, adj=0.026, (0 split)
##       Purpose.Furniture.Equipment      < 0.5    to the left,  agree=0.686, adj=0.026, (0 split)
## 
## Node number 37: 8 observations
##   predicted class=1  expected loss=0.125  P(node) =0.01
##     class counts:     1     7
##    probabilities: 0.125 0.875 
## 
## Node number 52: 31 observations
##   predicted class=0  expected loss=0.3548387  P(node) =0.03875
##     class counts:    20    11
##    probabilities: 0.645 0.355 
## 
## Node number 53: 105 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.352381  P(node) =0.13125
##     class counts:    37    68
##    probabilities: 0.352 0.648 
##   left son=106 (52 obs) right son=107 (53 obs)
##   Primary splits:
##       SavingsAccountBonds.lt.100 < 0.5    to the right, improve=2.455014, (0 missing)
##       Age                        < 48.5   to the right, improve=1.981837, (0 missing)
##       Amount                     < 931.5  to the left,  improve=1.964626, (0 missing)
##       Housing.Own                < 0.5    to the left,  improve=1.689451, (0 missing)
##       Personal.Male.Single       < 0.5    to the left,  improve=1.658566, (0 missing)
##   Surrogate splits:
##       SavingsAccountBonds.100.to.500 < 0.5    to the left,  agree=0.724, adj=0.442, (0 split)
##       Job.SkilledEmployee            < 0.5    to the left,  agree=0.610, adj=0.212, (0 split)
##       Age                            < 31.5   to the right, agree=0.600, adj=0.192, (0 split)
##       Telephone                      < 0.5    to the left,  agree=0.600, adj=0.192, (0 split)
##       Purpose.Furniture.Equipment    < 0.5    to the right, agree=0.571, adj=0.135, (0 split)
## 
## Node number 54: 7 observations
##   predicted class=0  expected loss=0.2857143  P(node) =0.00875
##     class counts:     5     2
##    probabilities: 0.714 0.286 
## 
## Node number 55: 51 observations
##   predicted class=1  expected loss=0.1176471  P(node) =0.06375
##     class counts:     6    45
##    probabilities: 0.118 0.882 
## 
## Node number 72: 82 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.3780488  P(node) =0.1025
##     class counts:    51    31
##    probabilities: 0.622 0.378 
##   left son=144 (40 obs) right son=145 (42 obs)
##   Primary splits:
##       Amount                   < 1577.5 to the left,  improve=1.658595, (0 missing)
##       Telephone                < 0.5    to the right, improve=1.449397, (0 missing)
##       Purpose.NewCar           < 0.5    to the right, improve=1.370800, (0 missing)
##       Purpose.Radio.Television < 0.5    to the left,  improve=1.132404, (0 missing)
##       Age                      < 55     to the left,  improve=1.081246, (0 missing)
##   Surrogate splits:
##       Purpose.Furniture.Equipment < 0.5    to the left,  agree=0.646, adj=0.275, (0 split)
##       Duration                    < 16.5   to the left,  agree=0.634, adj=0.250, (0 split)
##       InstallmentRatePercentage   < 3.5    to the right, agree=0.622, adj=0.225, (0 split)
##       Telephone                   < 0.5    to the right, agree=0.622, adj=0.225, (0 split)
##       Personal.Male.Single        < 0.5    to the left,  agree=0.622, adj=0.225, (0 split)
## 
## Node number 73: 39 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.4102564  P(node) =0.04875
##     class counts:    16    23
##    probabilities: 0.410 0.590 
##   left son=146 (26 obs) right son=147 (13 obs)
##   Primary splits:
##       Duration                < 15.5   to the right, improve=2.564103, (0 missing)
##       Telephone               < 0.5    to the left,  improve=1.538462, (0 missing)
##       EmploymentDuration.lt.1 < 0.5    to the right, improve=1.538462, (0 missing)
##       Age                     < 30.5   to the right, improve=1.257664, (0 missing)
##       Amount                  < 1961.5 to the right, improve=1.189036, (0 missing)
##   Surrogate splits:
##       Amount < 1828.5 to the right, agree=0.846, adj=0.538, (0 split)
##       Age    < 35     to the left,  agree=0.692, adj=0.077, (0 split)
## 
## Node number 106: 52 observations,    complexity param=0.01310044
##   predicted class=1  expected loss=0.4615385  P(node) =0.065
##     class counts:    24    28
##    probabilities: 0.462 0.538 
##   left son=212 (32 obs) right son=213 (20 obs)
##   Primary splits:
##       NumberExistingCredits   < 1.5    to the left,  improve=2.908654, (0 missing)
##       Duration                < 28.5   to the right, improve=2.447658, (0 missing)
##       Age                     < 35.5   to the right, improve=1.846154, (0 missing)
##       ResidenceDuration       < 1.5    to the right, improve=1.246671, (0 missing)
##       EmploymentDuration.gt.7 < 0.5    to the right, improve=1.231775, (0 missing)
##   Surrogate splits:
##       Age                          < 27.5   to the right, agree=0.673, adj=0.15, (0 split)
##       InstallmentRatePercentage    < 1.5    to the right, agree=0.654, adj=0.10, (0 split)
##       CreditHistory.PaidDuly       < 0.5    to the right, agree=0.654, adj=0.10, (0 split)
##       OtherInstallmentPlans.Stores < 0.5    to the left,  agree=0.654, adj=0.10, (0 split)
##       Job.UnemployedUnskilled      < 0.5    to the left,  agree=0.654, adj=0.10, (0 split)
## 
## Node number 107: 53 observations,    complexity param=0.01091703
##   predicted class=1  expected loss=0.245283  P(node) =0.06625
##     class counts:    13    40
##    probabilities: 0.245 0.755 
##   left son=214 (24 obs) right son=215 (29 obs)
##   Primary splits:
##       SavingsAccountBonds.100.to.500 < 0.5    to the right, improve=2.576664, (0 missing)
##       Amount                         < 1930   to the left,  improve=1.693587, (0 missing)
##       EmploymentDuration.lt.1        < 0.5    to the right, improve=1.611103, (0 missing)
##       EmploymentDuration.1.to.4      < 0.5    to the left,  improve=1.334922, (0 missing)
##       NumberExistingCredits          < 1.5    to the right, improve=1.280380, (0 missing)
##   Surrogate splits:
##       Purpose.NewCar            < 0.5    to the right, agree=0.679, adj=0.292, (0 split)
##       EmploymentDuration.1.to.4 < 0.5    to the left,  agree=0.660, adj=0.250, (0 split)
##       EmploymentDuration.lt.1   < 0.5    to the right, agree=0.642, adj=0.208, (0 split)
##       Age                       < 31.5   to the left,  agree=0.604, adj=0.125, (0 split)
##       InstallmentRatePercentage < 1.5    to the left,  agree=0.585, adj=0.083, (0 split)
## 
## Node number 144: 40 observations
##   predicted class=0  expected loss=0.275  P(node) =0.05
##     class counts:    29    11
##    probabilities: 0.725 0.275 
## 
## Node number 145: 42 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.4761905  P(node) =0.0525
##     class counts:    22    20
##    probabilities: 0.524 0.476 
##   left son=290 (30 obs) right son=291 (12 obs)
##   Primary splits:
##       Amount                     < 2135.5 to the right, improve=2.5190480, (0 missing)
##       SavingsAccountBonds.lt.100 < 0.5    to the right, improve=2.0836940, (0 missing)
##       Housing.Own                < 0.5    to the left,  improve=1.2857140, (0 missing)
##       Age                        < 24.5   to the right, improve=0.9523810, (0 missing)
##       NumberExistingCredits      < 1.5    to the right, improve=0.6857143, (0 missing)
##   Surrogate splits:
##       ForeignWorker < 0.5    to the right, agree=0.738, adj=0.083, (0 split)
## 
## Node number 146: 26 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.4615385  P(node) =0.0325
##     class counts:    14    12
##    probabilities: 0.538 0.462 
##   left son=292 (12 obs) right son=293 (14 obs)
##   Primary splits:
##       Amount                  < 3506.5 to the left,  improve=1.9945050, (0 missing)
##       Duration                < 19     to the left,  improve=1.3594410, (0 missing)
##       EmploymentDuration.lt.1 < 0.5    to the right, improve=1.0341880, (0 missing)
##       Age                     < 30.5   to the right, improve=0.8480769, (0 missing)
##       Property.CarOther       < 0.5    to the left,  improve=0.6175214, (0 missing)
##   Surrogate splits:
##       Age                      < 26.5   to the left,  agree=0.731, adj=0.417, (0 split)
##       Duration                 < 19     to the left,  agree=0.654, adj=0.250, (0 split)
##       Telephone                < 0.5    to the right, agree=0.654, adj=0.250, (0 split)
##       Purpose.Radio.Television < 0.5    to the right, agree=0.654, adj=0.250, (0 split)
##       Housing.Rent             < 0.5    to the right, agree=0.654, adj=0.250, (0 split)
## 
## Node number 147: 13 observations
##   predicted class=1  expected loss=0.1538462  P(node) =0.01625
##     class counts:     2    11
##    probabilities: 0.154 0.846 
## 
## Node number 212: 32 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.40625  P(node) =0.04
##     class counts:    19    13
##    probabilities: 0.594 0.406 
##   left son=424 (22 obs) right son=425 (10 obs)
##   Primary splits:
##       Age                     < 32.5   to the right, improve=2.5102270, (0 missing)
##       Duration                < 11.5   to the right, improve=1.7003570, (0 missing)
##       CreditHistory.PaidDuly  < 0.5    to the left,  improve=1.1002450, (0 missing)
##       Amount                  < 1383   to the left,  improve=0.8481280, (0 missing)
##       EmploymentDuration.gt.7 < 0.5    to the right, improve=0.5976732, (0 missing)
##   Surrogate splits:
##       EmploymentDuration.lt.1 < 0.5    to the left,  agree=0.812, adj=0.4, (0 split)
##       Duration                < 7.5    to the right, agree=0.719, adj=0.1, (0 split)
##       CreditHistory.PaidDuly  < 0.5    to the left,  agree=0.719, adj=0.1, (0 split)
## 
## Node number 213: 20 observations
##   predicted class=1  expected loss=0.25  P(node) =0.025
##     class counts:     5    15
##    probabilities: 0.250 0.750 
## 
## Node number 214: 24 observations,    complexity param=0.01091703
##   predicted class=1  expected loss=0.4166667  P(node) =0.03
##     class counts:    10    14
##    probabilities: 0.417 0.583 
##   left son=428 (7 obs) right son=429 (17 obs)
##   Primary splits:
##       Job.SkilledEmployee   < 0.5    to the left,  improve=3.8347340, (0 missing)
##       Amount                < 2814.5 to the left,  improve=1.9603730, (0 missing)
##       Age                   < 31.5   to the right, improve=1.1523810, (0 missing)
##       Property.CarOther     < 0.5    to the left,  improve=1.1523810, (0 missing)
##       NumberExistingCredits < 1.5    to the right, improve=0.6736597, (0 missing)
##   Surrogate splits:
##       Job.UnskilledResident        < 0.5    to the right, agree=0.875, adj=0.571, (0 split)
##       Amount                       < 6499   to the right, agree=0.833, adj=0.429, (0 split)
##       InstallmentRatePercentage    < 1.5    to the left,  agree=0.792, adj=0.286, (0 split)
##       Property.Insurance           < 0.5    to the right, agree=0.792, adj=0.286, (0 split)
##       OtherInstallmentPlans.Stores < 0.5    to the right, agree=0.792, adj=0.286, (0 split)
## 
## Node number 215: 29 observations
##   predicted class=1  expected loss=0.1034483  P(node) =0.03625
##     class counts:     3    26
##    probabilities: 0.103 0.897 
## 
## Node number 290: 30 observations,    complexity param=0.01310044
##   predicted class=0  expected loss=0.3666667  P(node) =0.0375
##     class counts:    19    11
##    probabilities: 0.633 0.367 
##   left son=580 (22 obs) right son=581 (8 obs)
##   Primary splits:
##       SavingsAccountBonds.lt.100 < 0.5    to the right, improve=3.206061, (0 missing)
##       Purpose.Radio.Television   < 0.5    to the left,  improve=1.456061, (0 missing)
##       Age                        < 28.5   to the right, improve=1.354148, (0 missing)
##       Duration                   < 15     to the left,  improve=1.274242, (0 missing)
##       NumberExistingCredits      < 1.5    to the right, improve=1.274242, (0 missing)
##   Surrogate splits:
##       SavingsAccountBonds.500.to.1000 < 0.5    to the left,  agree=0.833, adj=0.375, (0 split)
##       SavingsAccountBonds.gt.1000     < 0.5    to the left,  agree=0.833, adj=0.375, (0 split)
##       Age                             < 22.5   to the right, agree=0.800, adj=0.250, (0 split)
##       OtherInstallmentPlans.Stores    < 0.5    to the left,  agree=0.800, adj=0.250, (0 split)
##       Duration                        < 29     to the left,  agree=0.767, adj=0.125, (0 split)
## 
## Node number 291: 12 observations
##   predicted class=1  expected loss=0.25  P(node) =0.015
##     class counts:     3     9
##    probabilities: 0.250 0.750 
## 
## Node number 292: 12 observations
##   predicted class=0  expected loss=0.25  P(node) =0.015
##     class counts:     9     3
##    probabilities: 0.750 0.250 
## 
## Node number 293: 14 observations
##   predicted class=1  expected loss=0.3571429  P(node) =0.0175
##     class counts:     5     9
##    probabilities: 0.357 0.643 
## 
## Node number 424: 22 observations
##   predicted class=0  expected loss=0.2727273  P(node) =0.0275
##     class counts:    16     6
##    probabilities: 0.727 0.273 
## 
## Node number 425: 10 observations
##   predicted class=1  expected loss=0.3  P(node) =0.0125
##     class counts:     3     7
##    probabilities: 0.300 0.700 
## 
## Node number 428: 7 observations
##   predicted class=0  expected loss=0.1428571  P(node) =0.00875
##     class counts:     6     1
##    probabilities: 0.857 0.143 
## 
## Node number 429: 17 observations
##   predicted class=1  expected loss=0.2352941  P(node) =0.02125
##     class counts:     4    13
##    probabilities: 0.235 0.765 
## 
## Node number 580: 22 observations
##   predicted class=0  expected loss=0.2272727  P(node) =0.0275
##     class counts:    17     5
##    probabilities: 0.773 0.227 
## 
## Node number 581: 8 observations
##   predicted class=1  expected loss=0.25  P(node) =0.01
##     class counts:     2     6
##    probabilities: 0.250 0.750
printcp(tree.model)
## 
## Classification tree:
## rpart(formula = Class ~ ., data = train.df, method = "class")
## 
## Variables actually used in tree construction:
##  [1] Age                            Amount                        
##  [3] CheckingAccountStatus.0.to.200 CheckingAccountStatus.lt.0    
##  [5] Duration                       InstallmentRatePercentage     
##  [7] Job.SkilledEmployee            NumberExistingCredits         
##  [9] Property.RealEstate            Purpose.Business              
## [11] Purpose.UsedCar                SavingsAccountBonds.100.to.500
## [13] SavingsAccountBonds.lt.100    
## 
## Root node error: 229/800 = 0.28625
## 
## n= 800 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.030568      0   1.00000 1.0000 0.055828
## 2 0.024017      4   0.87773 1.0131 0.056045
## 3 0.019651      8   0.77293 1.0044 0.055901
## 4 0.013100     10   0.73362 1.0175 0.056116
## 5 0.010917     19   0.60699 1.0131 0.056045
## 6 0.010000     21   0.58515 1.0087 0.055973

Your Answer: From the complexity parameter table, the training error decreases as the number of splits increases, indicating improved fit on the training data. However, the cross-validated error (xerror) does not decrease significantly and remains around 1, suggesting that increasing model complexity does not improve predictive performance and may lead to overfitting.

The variable importance results show that Amount and Duration are the most influential predictors in determining the classification outcome.

4. Visualized the tree: (5pts)

rpart.plot(tree.model)

5. Use the training set to get prediected classes. (5pts)

train.pred <- predict(tree.model, train.df, type = "class")
head(train.pred)
## 578 549 557 700 255 913 
##   1   0   1   1   0   0 
## Levels: 0 1

6. Obtain confusion matrix and MR on training set. (5pts)

table(Predicted = train.pred, Actual = train.df$Class)
##          Actual
## Predicted   0   1
##         0 145  50
##         1  84 521
mean(train.pred != train.df$Class)
## [1] 0.1675

Your Answer: The confusion matrix indicates that the model correctly classified 145 observations in class 0 and 521 in class 1, while misclassifying 50 and 84 observations, respectively. The misclassification rate (MR) is 0.1675, meaning 16.75% of the training observations were incorrectly classified.

7. Use the testing set to get prediected classes. (5pts)

test.pred <- predict(tree.model, test.df, type = "class")

8. Obtain confusion matrix and MR on testing set. (5pts)

table(Predicted = test.pred, Actual = test.df$Class)
##          Actual
## Predicted   0   1
##         0  36  26
##         1  35 103
mean(test.pred != test.df$Class)
## [1] 0.305

Your Answer: The confusion matrix indicates that the model correctly classified 36 observations in class 0 and 103 in class 1, while misclassifying 26 and 35 observations, respectively. The misclassification rate (MR) is 0.305, indicating that 30.5% of the testing observations were incorrectly classified, which suggests lower performance on unseen data compared to the training set.

9 Obtain the ROC and AUC for testing data (not training). (5pts)

library(pROC)
## Warning: package 'pROC' was built under R version 4.4.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
test.prob <- predict(tree.model, test.df, type = "prob")[,2]

roc.obj <- roc(test.df$Class, test.prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc.obj)

auc(roc.obj)
## Area under the curve: 0.6743

10. (optional) use cp or other parameters to prune the tree see if you can get a better testing MR and testing AUC.

Part 2, Regression Tree Method on mtcar data (50pts in total)

Starter code for mtcars dataset

We will use the built-in mtcars dataset to predict miles per gallon (mpg) using other car characteristics. The dataset includes information about 32 cars from Motor Trend magazine (1973-74).

0. load the data (5pts)

# Load the mtcars dataset
data(mtcars)
# Display the structure of the dataset
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

1. Split the dataset into training and test set with 80-20 split. Use set.seed(2020) for reproducibility. (5pts)

set.seed(2020)

train_index <- sample(1:nrow(mtcars), 0.8 * nrow(mtcars))

train.df <- mtcars[train_index, ]
test.df  <- mtcars[-train_index, ]

2. Fit a basic regression tree model using the training set with mpg as the response variable. Set method = “anova”. (10pts)

tree.model <- rpart(mpg ~ ., data = train.df, method = "anova")

2. Visualize the tree using rpart.plot. Interpret the splits. (10pts)

rpart.plot(tree.model)

Your Answer:The tree shows that the primary and only split is based on the variable cyl.

Cars with 5 or more cylinders (cyl ≥ 5) fall into one group with a lower predicted mpg value of approximately 17, while cars with fewer than 5 cylinders (cyl < 5) fall into another group with a higher predicted mpg value of approximately 27.

This indicates that the number of cylinders is the most important factor in determining fuel efficiency in this model, with cars having fewer cylinders generally achieving better gas mileage.

3. Make predictions and calculate MSE and R-squared on training set. (10pts)

train.pred <- predict(tree.model, train.df)
mean((train.df$mpg - train.pred)^2)
## [1] 12.79054
1 - sum((train.df$mpg - train.pred)^2) / sum((train.df$mpg - mean(train.df$mpg))^2)
## [1] 0.6145199

Your Answer: The Mean Squared Error is 12.79, indicating the average squared difference between the predicted and actual mpg values. The R-squared value is 0.611, meaning that approximately 61.1% of the variability in mpg is explained by the model.

4. Make predictions and calculate MSE and R-squared on testing set. (10pts)

test.pred <- predict(tree.model, test.df)

mean((test.df$mpg - test.pred)^2)
## [1] 12.31011
1 - sum((test.df$mpg - test.pred)^2) / sum((test.df$mpg - mean(test.df$mpg))^2)
## [1] 0.7085027

Your Answer: The Mean Squared Error (MSE) is 12.31, indicating the average squared difference between predicted and actual mpg values. The R-squared value is 0.709, meaning that approximately 70.9% of the variability in mpg is explained by the model on the testing set.

Part 3 (Optional): Please recall the results from previous homework, how do you compare them? Just discuss.