Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : num  1 0 1 1 0 1 1 1 1 0 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: We can see that Class is now a numeric varible, where 1 = ‘Good’ and 0 = ‘Bad’. From CheckingAccountStatus.lt.0 to Job.Management.SelfEmp.HighlyQualified all variables are dummy with either 0 or 1. From Duration to NumberPeopleMaintenance the variables are int and are either numerical or categorical.

We know that InstallmentRatePercentage and ResidenceDuration are categorical, so we will convert them to factor.

GermanCredit$InstallmentRatePercentage<- as.factor(GermanCredit$InstallmentRatePercentage)
GermanCredit$ResidenceDuration<- as.factor(GermanCredit$ResidenceDuration)

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

Your observation: We have dropped all the variables that provide no information to the data

2. Explore the dataset to understand its structure.

dim(GermanCredit)

## [1] 1000   49

names(GermanCredit)

##  [1] "Duration"                           "Amount"                            
##  [3] "InstallmentRatePercentage"          "ResidenceDuration"                 
##  [5] "Age"                                "NumberExistingCredits"             
##  [7] "NumberPeopleMaintenance"            "Telephone"                         
##  [9] "ForeignWorker"                      "Class"                             
## [11] "CheckingAccountStatus.lt.0"         "CheckingAccountStatus.0.to.200"    
## [13] "CheckingAccountStatus.gt.200"       "CreditHistory.NoCredit.AllPaid"    
## [15] "CreditHistory.ThisBank.AllPaid"     "CreditHistory.PaidDuly"            
## [17] "CreditHistory.Delay"                "Purpose.NewCar"                    
## [19] "Purpose.UsedCar"                    "Purpose.Furniture.Equipment"       
## [21] "Purpose.Radio.Television"           "Purpose.DomesticAppliance"         
## [23] "Purpose.Repairs"                    "Purpose.Education"                 
## [25] "Purpose.Retraining"                 "Purpose.Business"                  
## [27] "SavingsAccountBonds.lt.100"         "SavingsAccountBonds.100.to.500"    
## [29] "SavingsAccountBonds.500.to.1000"    "SavingsAccountBonds.gt.1000"       
## [31] "EmploymentDuration.lt.1"            "EmploymentDuration.1.to.4"         
## [33] "EmploymentDuration.4.to.7"          "EmploymentDuration.gt.7"           
## [35] "Personal.Male.Divorced.Seperated"   "Personal.Female.NotSingle"         
## [37] "Personal.Male.Single"               "OtherDebtorsGuarantors.None"       
## [39] "OtherDebtorsGuarantors.CoApplicant" "Property.RealEstate"               
## [41] "Property.Insurance"                 "Property.CarOther"                 
## [43] "OtherInstallmentPlans.Bank"         "OtherInstallmentPlans.Stores"      
## [45] "Housing.Rent"                       "Housing.Own"                       
## [47] "Job.UnemployedUnskilled"            "Job.UnskilledResident"             
## [49] "Job.SkilledEmployee"

summary(GermanCredit)

##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   1:136                     1:130            
##  1st Qu.:12.0   1st Qu.: 1366   2:231                     2:308            
##  Median :18.0   Median : 2320   3:157                     3:149            
##  Mean   :20.9   Mean   : 3271   4:476                     4:413            
##  3rd Qu.:24.0   3rd Qu.: 3972                                              
##  Max.   :72.0   Max.   :18424                                              
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker       Class     CheckingAccountStatus.lt.0
##  Min.   :0.000   Min.   :0.0   Min.   :0.000             
##  1st Qu.:1.000   1st Qu.:0.0   1st Qu.:0.000             
##  Median :1.000   Median :1.0   Median :0.000             
##  Mean   :0.963   Mean   :0.7   Mean   :0.274             
##  3rd Qu.:1.000   3rd Qu.:1.0   3rd Qu.:1.000             
##  Max.   :1.000   Max.   :1.0   Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00

Your observation: Your observation: Now, there are 1000 observations and 40 variables instead of 60. As we saw before, a lot of variables are dummy variables and tehrefore their Max is 1 and their Min is 0. We can observe that duration, amount and age are continuous variables, while all the others are categorical variables. Class is the only character variable.

Now we are going to plot all the variables to see if there are any outliers.

boxplot(GermanCredit, las=2, cex.axis=0.6)

Your observation: From the histograms and boxplots, we can see that all variables are categorical besides, amount .None of them seems to have any outliers. Only Amount might have some outliers.

boxplot(GermanCredit$Amount)

It looks like this variables has outiers. We will not do any truncation or winsorization beacuse the data might be important for the analysis.

Lets look for missing values.

colSums(is.na(GermanCredit))

##                           Duration                             Amount 
##                                  0                                  0 
##          InstallmentRatePercentage                  ResidenceDuration 
##                                  0                                  0 
##                                Age              NumberExistingCredits 
##                                  0                                  0 
##            NumberPeopleMaintenance                          Telephone 
##                                  0                                  0 
##                      ForeignWorker                              Class 
##                                  0                                  0 
##         CheckingAccountStatus.lt.0     CheckingAccountStatus.0.to.200 
##                                  0                                  0 
##       CheckingAccountStatus.gt.200     CreditHistory.NoCredit.AllPaid 
##                                  0                                  0 
##     CreditHistory.ThisBank.AllPaid             CreditHistory.PaidDuly 
##                                  0                                  0 
##                CreditHistory.Delay                     Purpose.NewCar 
##                                  0                                  0 
##                    Purpose.UsedCar        Purpose.Furniture.Equipment 
##                                  0                                  0 
##           Purpose.Radio.Television          Purpose.DomesticAppliance 
##                                  0                                  0 
##                    Purpose.Repairs                  Purpose.Education 
##                                  0                                  0 
##                 Purpose.Retraining                   Purpose.Business 
##                                  0                                  0 
##         SavingsAccountBonds.lt.100     SavingsAccountBonds.100.to.500 
##                                  0                                  0 
##    SavingsAccountBonds.500.to.1000        SavingsAccountBonds.gt.1000 
##                                  0                                  0 
##            EmploymentDuration.lt.1          EmploymentDuration.1.to.4 
##                                  0                                  0 
##          EmploymentDuration.4.to.7            EmploymentDuration.gt.7 
##                                  0                                  0 
##   Personal.Male.Divorced.Seperated          Personal.Female.NotSingle 
##                                  0                                  0 
##               Personal.Male.Single        OtherDebtorsGuarantors.None 
##                                  0                                  0 
## OtherDebtorsGuarantors.CoApplicant                Property.RealEstate 
##                                  0                                  0 
##                 Property.Insurance                  Property.CarOther 
##                                  0                                  0 
##         OtherInstallmentPlans.Bank       OtherInstallmentPlans.Stores 
##                                  0                                  0 
##                       Housing.Rent                        Housing.Own 
##                                  0                                  0 
##            Job.UnemployedUnskilled              Job.UnskilledResident 
##                                  0                                  0 
##                Job.SkilledEmployee 
##                                  0

Obs:

There is no missing data

3. Split the dataset into training and test set. Please use the random seed as `2023` for reproducibility.

set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.60)
training_data = GermanCredit[index,]
testing_data = GermanCredit[-index,]

Your observation: We built a training dataset with 600 obs and a testing dataset with 400 obs.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right.

We will use the function rpart from package rpart.

Important!! We need to make the default as factor when fitting Classification Tree.

library(rpart)
library(rpart.plot)
# fit the model
class_tree <- rpart(as.factor(Class) ~ ., data=training_data)

rpart.plot(class_tree,extra=4, yesno=2)

Your Observation: We have built a classification tree model to predict the varaible Class. We have one root node, several decision nodes and 14 leaf nodes. For example from the first node we get that in Checking.Account.Status it is labeled as 1 and .29, .71. The predicted class is 1 (=Good) for this node. In this node there are 71% data actually 1 and 29% data actually 0. Then Foreign.Worker is predicted as 0 with 0.54 and 0.46. And so on.

2. Use the training set to get prediected classes.

# Make predictions on the train data
pred_credit_train <- predict(class_tree, training_data, type="class")

3. Obtain confusion matrix and MR on training set.

# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = training_data$Class,
                      pred = pred_credit_train)
Cmatrix_train

##     pred
## true   0   1
##    0 102  73
##    1  35 390

1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)

## [1] 0.18

Your observation: From the Confusion matrix we can see that 102 variables are TN, 73 are FP, 36 are FN and 389 are TP. The misclassification rate is 0.182, which is very low for the data.

4. Use the testing set to get prediected classes.

# Make predictions on the testing data
pred_credit_test <- predict(class_tree, testing_data, type="class")

5. Obtain confusion matrix and MR on testing set.

# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = testing_data$Class,
                     pred = pred_credit_test)
Cmatrix_test

##     pred
## true   0   1
##    0  50  75
##    1  48 227

1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)

## [1] 0.3075

Your observation: From the Confusion matrix we can see that 50 variables are TN, 75 are FP, 48 are FN and 227 are TP. The missclassification rate is 0.3075, which is higher than for the training set, but still low.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right.

# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 2,  # cost of 2 for FP
                        1, 0),  # cost of 1 for FN
                      byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(Class) ~ ., data=training_data, 
                       parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)

Your observation: The constructed tree model is now bigger. Mos of the variables are now predicted as 1 (=Good), whereas before it was 50% to 50%. CheckingAccountStatus is for example predicted as 1. 0.71 are 1 and 0.19 are 0 in that prediction. And so on.

2. Use the training set to get prediected probabilities and classes.

#get predictions for training
pred_credit_train <- predict(fit_tree_asym, training_data,
                             type = "class")

Your observation:

3. Obtain confusion matrix and MR on training set (use predicted classes).

#C matrix for training
Cmatrix_test = table( true = training_data$Class, pred = pred_credit_train)
Cmatrix_test

##     pred
## true   0   1
##    0 120  55
##    1  58 367

1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)

## [1] 0.1883333

Your observation: From the Confusion matrix using weighted class cost on the training data we can see that 120 variables are TN, 55 are FP, 58 are FN and 367 are TP. The missclassifications rate is 0.188, which is very low for the data, but higher that then MR for the training set without weighted costs.

4. Obtain ROC and AUC on training set (use predicted probabilities).

# obtain predicted probability
pred_prob_train = predict(class_tree, training_data, type = "prob")

# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_prob_train = pred_prob_train[,"1"]

# Looks familar, right?
library(ROCR)
pred <- prediction(pred_prob_train, training_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
AUC = unlist(slot(performance(pred, "auc"), "y.values"))
AUC

## [1] 0.7995227

Your observation: We get an AUC of 0.7995 for the training set using predicted probabilities. This score is pretty high.

5. Use the testing set to get prediected probabilities and classes.

#get predictions for testing
pred_credit_test <- predict(fit_tree_asym, testing_data, type = "class")

Your observation:

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

#C matrix for testing with more weights on "1"
Cmatrix_test_weight = table( true = testing_data$Class, pred = pred_credit_test)
Cmatrix_test_weight

##     pred
## true   0   1
##    0  67  58
##    1  57 218

1 - sum(diag(Cmatrix_test_weight))/sum(Cmatrix_test_weight)

## [1] 0.2875

Your observation: From the Confusion matrix using weighted class cost on the testing data we can see that 72 variables are TN, 53 are FP, 66 are FN and 209 are TP. The missclassifications rate is 0.2875, which is kind of low for the data, but a bit higher that then MR for the testing set without weighted costs, which was 0.3075.

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

# obtain predicted probability
pred_prob_test = predict(class_tree, testing_data, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, testing_data$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.6943855

Your observation: We get an AUC of 0.6944 for the testing set using predicted probabilities. This score is still pretty high, but it is lower than 0.7995 on the training set.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

We have done a classification analysis to predict wether class is good (1) or Bad (0) depending on the other variables. We did two different classification trees. One with no weights and another one with weights of 2 for FP and 1 for FN. The second classification tree (the one with weight) had 10 Decision nodes on the 1 and 4 Decision Nodes on the 0. The one with no weights had 6 on each side respectively. We split the data into training and testing and did MR for all models and AUC and ROC for the models with weights. We got the following results. Best MR result was on the training model with no weights (0.182), followed by training set on weighted model (0.1883), then the Testing set with weights (0.2875) and then the testing set with no weights (0.3075). Best models therefore was the training in the one with no weights and the testing on the one with weights. When doing the AUC on the models with weights, the best model was also on the training set (0.7995) vs the testing set (0.6944). Overall all models performed pretty good having a samll MR and big AUC.

2. How do you compare Tree model to logistic regression and SVM? Only for this question, you don’t need to show numbers, just answer based on your understanding.

First all the above mentioned models are used for classification. logistic Regression, Classification Trees and SVM are used to predict binary variables, just in a different way. Decision Trees build trees to predict variables, Logistic regression interprets the variables through odds ratio and coefficients and can make you understand which variables are significant. SVM tries to find a line, either in a 2D or 3D space in order to separate the binary variable. SVM tries to have the maximum margin linear classifier to have less errors.

If you are looking for interpretability decision tree might be a good choice since they are easy to understand and interpret. It is also easy to implement and can use different data types. Logistic regression provides coefficients that indicate the strength and direction of the relationship between independent variables and the target variable. It’s easily interpretative also. From the model summary we can already get the values that are significant and interpret their coefficients. It also has a odds ratio, where for every one-unit change in an independent variable, the odds ratio tells us how much the odds of the dependent variable being 1 would change(SVM and Trees do not have that) SVM model is not so easy to interpret as we can only find out ROC-AUC, Correlation Matrix and MR. In Decision Trees you use the some parameters to interpret the reults but in logistic Regression we use MSE and Correlation Matrix to evaluate.

Decision trees and SVM can handle non-linearity in the data vs Logistic regression that can not handle non-linearity in the data. If you need a versatile algorithm that can handle both linear and non-linear relationships in high-dimensional spaces, SVM might be a good option. It is the only model out of the three that can handle high-dimensional spaces.

All three models are prone to overfit.

Decision trees can handle both numerical and categorical data without the need for extensive pre-processing. Logistic regression requires numerical input features. Categorical variables need to be transformed through techniques like one-hot encoding.Like logistic regression, for SVM categorical variables may need encoding.

All of the models can use different weights to improve the model.

Homework7

Tianhai Zu

11/01/2023

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure.

3. Split the dataset into training and test set. Please use the random seed as `2023` for reproducibility.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right.

Important!! We need to make the default as factor when fitting Classification Tree.

2. Use the training set to get prediected classes.

3. Obtain confusion matrix and MR on training set.

4. Use the testing set to get prediected classes.

5. Obtain confusion matrix and MR on testing set.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right.

2. Use the training set to get prediected probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get prediected probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

2. How do you compare Tree model to logistic regression and SVM? Only for this question, you don’t need to show numbers, just answer based on your understanding.

Homework7

Tianhai Zu

11/01/2023

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure.

3. Split the dataset into training and test set. Please use the random seed as 2023 for reproducibility.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right.

Important!! We need to make the default as factor when fitting Classification Tree.

2. Use the training set to get prediected classes.

3. Obtain confusion matrix and MR on training set.

4. Use the testing set to get prediected classes.

5. Obtain confusion matrix and MR on testing set.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right.

2. Use the training set to get prediected probabilities and classes.

3. Obtain confusion matrix and MR on training set (use predicted classes).

4. Obtain ROC and AUC on training set (use predicted probabilities).

5. Use the testing set to get prediected probabilities and classes.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

2. How do you compare Tree model to logistic regression and SVM? Only for this question, you don’t need to show numbers, just answer based on your understanding.

3. Split the dataset into training and test set. Please use the random seed as `2023` for reproducibility.