Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

if (!require("caret", quietly = TRUE)) {
    install.packages("caret")
}

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: The GermanCredit dataset contains 1,000 observations and 62 variables. There are no missing values.

#This is an optional code that drop variables that provide no information in the data
#GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure. (10pts)

summary(GermanCredit)

##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker     Class         CheckingAccountStatus.lt.0
##  Min.   :0.000   Mode :logical   Min.   :0.000             
##  1st Qu.:1.000   FALSE:300       1st Qu.:0.000             
##  Median :1.000   TRUE :700       Median :0.000             
##  Mean   :0.963                   Mean   :0.274             
##  3rd Qu.:1.000                   3rd Qu.:1.000             
##  Max.   :1.000                   Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CheckingAccountStatus.none CreditHistory.NoCredit.AllPaid
##  Min.   :0.000              Min.   :0.00                  
##  1st Qu.:0.000              1st Qu.:0.00                  
##  Median :0.000              Median :0.00                  
##  Mean   :0.394              Mean   :0.04                  
##  3rd Qu.:1.000              3rd Qu.:0.00                  
##  Max.   :1.000              Max.   :1.00                  
##  CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
##  Min.   :0.000                  Min.   :0.00           Min.   :0.000      
##  1st Qu.:0.000                  1st Qu.:0.00           1st Qu.:0.000      
##  Median :0.000                  Median :1.00           Median :0.000      
##  Mean   :0.049                  Mean   :0.53           Mean   :0.088      
##  3rd Qu.:0.000                  3rd Qu.:1.00           3rd Qu.:0.000      
##  Max.   :1.000                  Max.   :1.00           Max.   :1.000      
##  CreditHistory.Critical Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.000          Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000          1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.000          Median :0.000   Median :0.000  
##  Mean   :0.293          Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.000          3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.000          Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Vacation Purpose.Retraining
##  Min.   :0.000   Min.   :0.00      Min.   :0        Min.   :0.000     
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0        1st Qu.:0.000     
##  Median :0.000   Median :0.00      Median :0        Median :0.000     
##  Mean   :0.022   Mean   :0.05      Mean   :0        Mean   :0.009     
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0        3rd Qu.:0.000     
##  Max.   :1.000   Max.   :1.00      Max.   :0        Max.   :1.000     
##  Purpose.Business Purpose.Other   SavingsAccountBonds.lt.100
##  Min.   :0.000    Min.   :0.000   Min.   :0.000             
##  1st Qu.:0.000    1st Qu.:0.000   1st Qu.:0.000             
##  Median :0.000    Median :0.000   Median :1.000             
##  Mean   :0.097    Mean   :0.012   Mean   :0.603             
##  3rd Qu.:0.000    3rd Qu.:0.000   3rd Qu.:1.000             
##  Max.   :1.000    Max.   :1.000   Max.   :1.000             
##  SavingsAccountBonds.100.to.500 SavingsAccountBonds.500.to.1000
##  Min.   :0.000                  Min.   :0.000                  
##  1st Qu.:0.000                  1st Qu.:0.000                  
##  Median :0.000                  Median :0.000                  
##  Mean   :0.103                  Mean   :0.063                  
##  3rd Qu.:0.000                  3rd Qu.:0.000                  
##  Max.   :1.000                  Max.   :1.000                  
##  SavingsAccountBonds.gt.1000 SavingsAccountBonds.Unknown
##  Min.   :0.000               Min.   :0.000              
##  1st Qu.:0.000               1st Qu.:0.000              
##  Median :0.000               Median :0.000              
##  Mean   :0.048               Mean   :0.183              
##  3rd Qu.:0.000               3rd Qu.:0.000              
##  Max.   :1.000               Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 EmploymentDuration.Unemployed
##  Min.   :0.000           Min.   :0.000                
##  1st Qu.:0.000           1st Qu.:0.000                
##  Median :0.000           Median :0.000                
##  Mean   :0.253           Mean   :0.062                
##  3rd Qu.:1.000           3rd Qu.:0.000                
##  Max.   :1.000           Max.   :1.000                
##  Personal.Male.Divorced.Seperated Personal.Female.NotSingle
##  Min.   :0.00                     Min.   :0.00             
##  1st Qu.:0.00                     1st Qu.:0.00             
##  Median :0.00                     Median :0.00             
##  Mean   :0.05                     Mean   :0.31             
##  3rd Qu.:0.00                     3rd Qu.:1.00             
##  Max.   :1.00                     Max.   :1.00             
##  Personal.Male.Single Personal.Male.Married.Widowed Personal.Female.Single
##  Min.   :0.000        Min.   :0.000                 Min.   :0             
##  1st Qu.:0.000        1st Qu.:0.000                 1st Qu.:0             
##  Median :1.000        Median :0.000                 Median :0             
##  Mean   :0.548        Mean   :0.092                 Mean   :0             
##  3rd Qu.:1.000        3rd Qu.:0.000                 3rd Qu.:0             
##  Max.   :1.000        Max.   :1.000                 Max.   :0             
##  OtherDebtorsGuarantors.None OtherDebtorsGuarantors.CoApplicant
##  Min.   :0.000               Min.   :0.000                     
##  1st Qu.:1.000               1st Qu.:0.000                     
##  Median :1.000               Median :0.000                     
##  Mean   :0.907               Mean   :0.041                     
##  3rd Qu.:1.000               3rd Qu.:0.000                     
##  Max.   :1.000               Max.   :1.000                     
##  OtherDebtorsGuarantors.Guarantor Property.RealEstate Property.Insurance
##  Min.   :0.000                    Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                    1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                    Median :0.000       Median :0.000     
##  Mean   :0.052                    Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                    3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                    Max.   :1.000       Max.   :1.000     
##  Property.CarOther Property.Unknown OtherInstallmentPlans.Bank
##  Min.   :0.000     Min.   :0.000    Min.   :0.000             
##  1st Qu.:0.000     1st Qu.:0.000    1st Qu.:0.000             
##  Median :0.000     Median :0.000    Median :0.000             
##  Mean   :0.332     Mean   :0.154    Mean   :0.139             
##  3rd Qu.:1.000     3rd Qu.:0.000    3rd Qu.:0.000             
##  Max.   :1.000     Max.   :1.000    Max.   :1.000             
##  OtherInstallmentPlans.Stores OtherInstallmentPlans.None  Housing.Rent  
##  Min.   :0.000                Min.   :0.000              Min.   :0.000  
##  1st Qu.:0.000                1st Qu.:1.000              1st Qu.:0.000  
##  Median :0.000                Median :1.000              Median :0.000  
##  Mean   :0.047                Mean   :0.814              Mean   :0.179  
##  3rd Qu.:0.000                3rd Qu.:1.000              3rd Qu.:0.000  
##  Max.   :1.000                Max.   :1.000              Max.   :1.000  
##   Housing.Own    Housing.ForFree Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :1.000   Median :0.000   Median :0.000           Median :0.0          
##  Mean   :0.713   Mean   :0.108   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee Job.Management.SelfEmp.HighlyQualified
##  Min.   :0.00        Min.   :0.000                         
##  1st Qu.:0.00        1st Qu.:0.000                         
##  Median :1.00        Median :0.000                         
##  Mean   :0.63        Mean   :0.148                         
##  3rd Qu.:1.00        3rd Qu.:0.000                         
##  Max.   :1.00        Max.   :1.000

dim(GermanCredit)

## [1] 1000   62

colSums(is.na(GermanCredit))

##                               Duration                                 Amount 
##                                      0                                      0 
##              InstallmentRatePercentage                      ResidenceDuration 
##                                      0                                      0 
##                                    Age                  NumberExistingCredits 
##                                      0                                      0 
##                NumberPeopleMaintenance                              Telephone 
##                                      0                                      0 
##                          ForeignWorker                                  Class 
##                                      0                                      0 
##             CheckingAccountStatus.lt.0         CheckingAccountStatus.0.to.200 
##                                      0                                      0 
##           CheckingAccountStatus.gt.200             CheckingAccountStatus.none 
##                                      0                                      0 
##         CreditHistory.NoCredit.AllPaid         CreditHistory.ThisBank.AllPaid 
##                                      0                                      0 
##                 CreditHistory.PaidDuly                    CreditHistory.Delay 
##                                      0                                      0 
##                 CreditHistory.Critical                         Purpose.NewCar 
##                                      0                                      0 
##                        Purpose.UsedCar            Purpose.Furniture.Equipment 
##                                      0                                      0 
##               Purpose.Radio.Television              Purpose.DomesticAppliance 
##                                      0                                      0 
##                        Purpose.Repairs                      Purpose.Education 
##                                      0                                      0 
##                       Purpose.Vacation                     Purpose.Retraining 
##                                      0                                      0 
##                       Purpose.Business                          Purpose.Other 
##                                      0                                      0 
##             SavingsAccountBonds.lt.100         SavingsAccountBonds.100.to.500 
##                                      0                                      0 
##        SavingsAccountBonds.500.to.1000            SavingsAccountBonds.gt.1000 
##                                      0                                      0 
##            SavingsAccountBonds.Unknown                EmploymentDuration.lt.1 
##                                      0                                      0 
##              EmploymentDuration.1.to.4              EmploymentDuration.4.to.7 
##                                      0                                      0 
##                EmploymentDuration.gt.7          EmploymentDuration.Unemployed 
##                                      0                                      0 
##       Personal.Male.Divorced.Seperated              Personal.Female.NotSingle 
##                                      0                                      0 
##                   Personal.Male.Single          Personal.Male.Married.Widowed 
##                                      0                                      0 
##                 Personal.Female.Single            OtherDebtorsGuarantors.None 
##                                      0                                      0 
##     OtherDebtorsGuarantors.CoApplicant       OtherDebtorsGuarantors.Guarantor 
##                                      0                                      0 
##                    Property.RealEstate                     Property.Insurance 
##                                      0                                      0 
##                      Property.CarOther                       Property.Unknown 
##                                      0                                      0 
##             OtherInstallmentPlans.Bank           OtherInstallmentPlans.Stores 
##                                      0                                      0 
##             OtherInstallmentPlans.None                           Housing.Rent 
##                                      0                                      0 
##                            Housing.Own                        Housing.ForFree 
##                                      0                                      0 
##                Job.UnemployedUnskilled                  Job.UnskilledResident 
##                                      0                                      0 
##                    Job.SkilledEmployee Job.Management.SelfEmp.HighlyQualified 
##                                      0                                      0

table(GermanCredit$Class)

## 
## FALSE  TRUE 
##   300   700

Your observation: The response variable Class (converted to logical) is imbalanced, with approximately 700 “Good” (TRUE) and 300 “Bad” (FALSE) customers (70/30 split). Most predictors are numeric dummy variables created from the original categorical features.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

set.seed(2024)
index <- sample(1:nrow(GermanCredit), nrow(GermanCredit) * 0.8) 
GermanCredit_train <- GermanCredit[index, ]
GermanCredit_test <- GermanCredit[-index, ]

dim(GermanCredit_train)

## [1] 800  62

dim(GermanCredit_test)

## [1] 200  62

table(GermanCredit_train$Class)

## 
## FALSE  TRUE 
##   229   571

table(GermanCredit_test$Class)

## 
## FALSE  TRUE 
##    71   129

Your observation: Using set.seed(2024), the data was split into a training set of 800 observations and a test set of 200 observations. The class distribution is well preserved in both the training and test sets.

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

glm_credit <- glm(Class ~ ., family = binomial, data = GermanCredit_train)

Your observation: The logistic regression model was successfully fitted using the training data and all available predictors. Since the response variable is binary, logistic regression is appropriate for this problem. Using all predictors allows the model to capture multiple borrower characteristics that may influence whether a customer is classified as good or bad.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

summary(glm_credit)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = GermanCredit_train)
## 
## Coefficients: (13 not defined because of singularities)
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             9.241e+00  1.719e+00   5.376 7.61e-08
## Duration                               -2.994e-02  1.072e-02  -2.794 0.005214
## Amount                                 -1.771e-04  5.095e-05  -3.475 0.000510
## InstallmentRatePercentage              -3.718e-01  1.036e-01  -3.589 0.000332
## ResidenceDuration                       2.577e-02  1.010e-01   0.255 0.798510
## Age                                     1.183e-02  1.097e-02   1.078 0.280974
## NumberExistingCredits                  -1.225e-01  2.189e-01  -0.560 0.575690
## NumberPeopleMaintenance                -1.731e-01  2.945e-01  -0.588 0.556678
## Telephone                              -4.236e-01  2.371e-01  -1.786 0.074081
## ForeignWorker                          -1.651e+00  7.421e-01  -2.224 0.026143
## CheckingAccountStatus.lt.0             -1.817e+00  2.710e-01  -6.703 2.04e-11
## CheckingAccountStatus.0.to.200         -1.432e+00  2.686e-01  -5.330 9.81e-08
## CheckingAccountStatus.gt.200           -5.912e-01  4.631e-01  -1.277 0.201696
## CheckingAccountStatus.none                     NA         NA      NA       NA
## CreditHistory.NoCredit.AllPaid         -8.724e-01  5.139e-01  -1.698 0.089584
## CreditHistory.ThisBank.AllPaid         -1.676e+00  5.493e-01  -3.052 0.002277
## CreditHistory.PaidDuly                 -6.686e-01  2.939e-01  -2.275 0.022899
## CreditHistory.Delay                    -9.413e-01  3.780e-01  -2.491 0.012756
## CreditHistory.Critical                         NA         NA      NA       NA
## Purpose.NewCar                         -1.733e+00  1.013e+00  -1.710 0.087282
## Purpose.UsedCar                         6.716e-02  1.033e+00   0.065 0.948146
## Purpose.Furniture.Equipment            -8.257e-01  1.015e+00  -0.814 0.415816
## Purpose.Radio.Television               -8.386e-01  1.019e+00  -0.823 0.410457
## Purpose.DomesticAppliance              -1.227e+00  1.328e+00  -0.923 0.355762
## Purpose.Repairs                        -1.321e+00  1.165e+00  -1.134 0.256825
## Purpose.Education                      -2.020e+00  1.088e+00  -1.857 0.063374
## Purpose.Vacation                               NA         NA      NA       NA
## Purpose.Retraining                      4.276e-01  1.640e+00   0.261 0.794237
## Purpose.Business                       -8.618e-01  1.032e+00  -0.835 0.403529
## Purpose.Other                                  NA         NA      NA       NA
## SavingsAccountBonds.lt.100             -1.266e+00  3.201e-01  -3.956 7.63e-05
## SavingsAccountBonds.100.to.500         -1.075e+00  4.171e-01  -2.577 0.009964
## SavingsAccountBonds.500.to.1000        -8.768e-01  5.216e-01  -1.681 0.092761
## SavingsAccountBonds.gt.1000             1.301e-02  6.161e-01   0.021 0.983157
## SavingsAccountBonds.Unknown                    NA         NA      NA       NA
## EmploymentDuration.lt.1                 3.581e-01  5.167e-01   0.693 0.488195
## EmploymentDuration.1.to.4               5.527e-01  5.000e-01   1.105 0.268967
## EmploymentDuration.4.to.7               9.863e-01  5.355e-01   1.842 0.065524
## EmploymentDuration.gt.7                 5.253e-01  5.039e-01   1.042 0.297218
## EmploymentDuration.Unemployed                  NA         NA      NA       NA
## Personal.Male.Divorced.Seperated       -2.546e-01  5.214e-01  -0.488 0.625274
## Personal.Female.NotSingle              -1.274e-01  3.573e-01  -0.357 0.721452
## Personal.Male.Single                    4.118e-01  3.623e-01   1.137 0.255622
## Personal.Male.Married.Widowed                  NA         NA      NA       NA
## Personal.Female.Single                         NA         NA      NA       NA
## OtherDebtorsGuarantors.None            -1.239e+00  5.370e-01  -2.308 0.021018
## OtherDebtorsGuarantors.CoApplicant     -1.565e+00  6.828e-01  -2.292 0.021919
## OtherDebtorsGuarantors.Guarantor               NA         NA      NA       NA
## Property.RealEstate                     7.166e-01  4.898e-01   1.463 0.143477
## Property.Insurance                      3.544e-01  4.785e-01   0.741 0.458926
## Property.CarOther                       6.110e-01  4.648e-01   1.314 0.188702
## Property.Unknown                               NA         NA      NA       NA
## OtherInstallmentPlans.Bank             -8.504e-01  2.730e-01  -3.115 0.001838
## OtherInstallmentPlans.Stores           -4.293e-01  4.711e-01  -0.911 0.362139
## OtherInstallmentPlans.None                     NA         NA      NA       NA
## Housing.Rent                           -9.538e-01  5.624e-01  -1.696 0.089924
## Housing.Own                            -2.723e-01  5.282e-01  -0.516 0.606157
## Housing.ForFree                                NA         NA      NA       NA
## Job.UnemployedUnskilled                 1.449e+00  8.788e-01   1.649 0.099175
## Job.UnskilledResident                  -2.641e-03  4.101e-01  -0.006 0.994861
## Job.SkilledEmployee                    -1.073e-02  3.349e-01  -0.032 0.974438
## Job.Management.SelfEmp.HighlyQualified         NA         NA      NA       NA
##                                           
## (Intercept)                            ***
## Duration                               ** 
## Amount                                 ***
## InstallmentRatePercentage              ***
## ResidenceDuration                         
## Age                                       
## NumberExistingCredits                     
## NumberPeopleMaintenance                   
## Telephone                              .  
## ForeignWorker                          *  
## CheckingAccountStatus.lt.0             ***
## CheckingAccountStatus.0.to.200         ***
## CheckingAccountStatus.gt.200              
## CheckingAccountStatus.none                
## CreditHistory.NoCredit.AllPaid         .  
## CreditHistory.ThisBank.AllPaid         ** 
## CreditHistory.PaidDuly                 *  
## CreditHistory.Delay                    *  
## CreditHistory.Critical                    
## Purpose.NewCar                         .  
## Purpose.UsedCar                           
## Purpose.Furniture.Equipment               
## Purpose.Radio.Television                  
## Purpose.DomesticAppliance                 
## Purpose.Repairs                           
## Purpose.Education                      .  
## Purpose.Vacation                          
## Purpose.Retraining                        
## Purpose.Business                          
## Purpose.Other                             
## SavingsAccountBonds.lt.100             ***
## SavingsAccountBonds.100.to.500         ** 
## SavingsAccountBonds.500.to.1000        .  
## SavingsAccountBonds.gt.1000               
## SavingsAccountBonds.Unknown               
## EmploymentDuration.lt.1                   
## EmploymentDuration.1.to.4                 
## EmploymentDuration.4.to.7              .  
## EmploymentDuration.gt.7                   
## EmploymentDuration.Unemployed             
## Personal.Male.Divorced.Seperated          
## Personal.Female.NotSingle                 
## Personal.Male.Single                      
## Personal.Male.Married.Widowed             
## Personal.Female.Single                    
## OtherDebtorsGuarantors.None            *  
## OtherDebtorsGuarantors.CoApplicant     *  
## OtherDebtorsGuarantors.Guarantor          
## Property.RealEstate                       
## Property.Insurance                        
## Property.CarOther                         
## Property.Unknown                          
## OtherInstallmentPlans.Bank             ** 
## OtherInstallmentPlans.Stores              
## OtherInstallmentPlans.None                
## Housing.Rent                           .  
## Housing.Own                               
## Housing.ForFree                           
## Job.UnemployedUnskilled                .  
## Job.UnskilledResident                     
## Job.SkilledEmployee                       
## Job.Management.SelfEmp.HighlyQualified    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 958.02  on 799  degrees of freedom
## Residual deviance: 672.78  on 751  degrees of freedom
## AIC: 770.78
## 
## Number of Fisher Scoring iterations: 5

Your observation: The model output shows that several predictors are statistically significant, meaning they help explain the probability that a customer is classified as good. In particular, variables related to checking account status, savings account status, and credit history appear important.

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

pred_prob_train <- predict(glm_credit, type = "response")

Your observation:
Many predicted probabilities are relatively high, which makes sense because the dataset contains more good customers than bad customers. This suggests the model is capturing the class imbalance and assigning higher probabilities to the majority class.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

costfunc <- function(obs, pred.p, pcut) {
  weight_FN <- 1
  weight_FP <- 1
  pred_class <- (pred.p >= pcut)
  FN <- (obs == TRUE) & (pred_class == FALSE)
  FP <- (obs == FALSE) & (pred_class == TRUE)
  cost <- mean(weight_FN * FN + weight_FP * FP)
  return(cost)
}

pcut.seq <- seq(0.01, 0.99, by = 0.01)
MR_vec <- rep(0, length(pcut.seq))

for (i in 1:length(pcut.seq)) {
  MR_vec[i] <- costfunc(obs = GermanCredit_train$Class, pred.p = pred_prob_train, pcut = pcut.seq[i])
  
}

optimal.pcut <- pcut.seq[which.min(MR_vec)]
print(paste("Optimal cut-off (equal weight):", round(optimal.pcut, 3)))

## [1] "Optimal cut-off (equal weight): 0.39"

Your observation: Using equal weights for false negatives and false positives, the optimal probability cut-off is 0.39 This cut-off is lower than the default value of 0.50, which means the model performs better when it is slightly more willing to classify customers as good. This result likely reflects the class distribution in the data and the trade-off needed to minimize overall misclassification rate.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

pred_class_train <- (pred_prob_train >= optimal.pcut) * 1
conf_train <- table(GermanCredit_train$Class, pred_class_train, 
                    dnn = c("True", "Predicted"))
print(conf_train)

##        Predicted
## True      0   1
##   FALSE 106 123
##   TRUE   34 537

MR_train <- mean(GermanCredit_train$Class != (pred_prob_train >= optimal.pcut))
print(paste("Training MR:", round(MR_train, 4)))

## [1] "Training MR: 0.1962"

Your observation: Using the optimal cut-off of 0.39, the model correctly classified 106 bad customers and 537 good customers in the training set, while misclassifying 123 bad customers as good and 34 good customers as bad. The training misclassification rate is 0.1962, meaning the model makes errors on about 19.62% of the training observations. This indicates reasonably good performance on the training data.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

library(ROCR)

pred_train <- prediction(pred_prob_train, GermanCredit_train$Class)
ROC_train <- performance(pred_train, "tpr", "fpr")
plot(ROC_train, colorize = TRUE, main = "ROC Curve - Training")

auc_train <- performance(pred_train, "auc")
auc_train <- unlist(slot(auc_train, "y.values"))
print(paste("Training AUC:", round(auc_train, 4)))

## [1] "Training AUC: 0.8505"

Your observation: The ROC curve for the training set shows that the model has good discriminatory ability. The training AUC is 0.8505, which is well above 0.50 and indicates that the model does a strong job of separating good customers from bad customers. In other words, the model ranks observations fairly well even before applying a cut-off.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

pred_prob_test <- predict(glm_credit, newdata = GermanCredit_test, type = "response")
pred_class_test <- (pred_prob_test >= optimal.pcut) * 1

conf_test <- table(GermanCredit_test$Class, pred_class_test, 
                   dnn = c("True", "Predicted"))
print(conf_test)

##        Predicted
## True      0   1
##   FALSE  28  43
##   TRUE   12 117

MR_test <- mean(GermanCredit_test$Class != (pred_prob_test >= optimal.pcut))
print(paste("Test MR:", round(MR_test, 4)))

## [1] "Test MR: 0.275"

Your observation: Using the same cut-off of 0.39 on the test set, the model correctly classified 28 bad customers and 117 good customers, while misclassifying 43 bad customers as good and 12 good customers as bad. The test misclassification rate is 0.2750, meaning about 27.5% of test observations were classified incorrectly. Since this error rate is higher than the training MR, the model performs worse on new data than on the training set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

pred_test <- prediction(pred_prob_test, GermanCredit_test$Class)
ROC_test <- performance(pred_test, "tpr", "fpr")
plot(ROC_test, colorize = TRUE, main = "ROC Curve - Test")

auc_test <- performance(pred_test, "auc")
auc_test <- unlist(slot(auc_test, "y.values"))
print(paste("Test AUC:", round(auc_test, 4)))

## [1] "Test AUC: 0.7353"

Your observation: The test AUC is 0.7353, which is lower than the training AUC of 0.8505. This means the model still has acceptable predictive ability on unseen data, but its performance drops when applied to the test set.

Task 5: Using different weights (20pts)

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

costfunc_weighted <- function(obs, pred.p, pcut) {
  weight_FN <- 5   # more expensive to miss a bad customer
  weight_FP <- 1
  pred_class <- (pred.p >= pcut)
  FN <- (obs == TRUE) & (pred_class == FALSE)
  FP <- (obs == FALSE) & (pred_class == TRUE)
  cost <- mean(weight_FN * FN + weight_FP * FP)
  return(cost)
}

MR_vec_w <- rep(0, length(pcut.seq))
for (i in 1:length(pcut.seq)) {
  MR_vec_w[i] <- costfunc_weighted(obs = GermanCredit_train$Class, 
                                   pred.p = pred_prob_train, 
                                   pcut = pcut.seq[i])
}

optimal.pcut_w <- pcut.seq[which.min(MR_vec_w)]
print(paste("Optimal cut-off (weighted 5:1):", round(optimal.pcut_w, 3)))

## [1] "Optimal cut-off (weighted 5:1): 0.22"

Your observation:When false negatives are given a higher cost than false positives, the optimal cut-off decreases to 0.22. This lower threshold makes the model more likely to predict a customer as good, which helps reduce costly classification mistakes under the weighted setting. The change in cut-off shows how business priorities can directly affect classification decisions.

2. Obtain the confusion matrix and MR for the training set.

pred_class_train_w <- (pred_prob_train >= optimal.pcut_w) * 1
conf_train_w <- table(GermanCredit_train$Class, pred_class_train_w, 
                      dnn = c("True", "Predicted"))
print(conf_train_w)

##        Predicted
## True      0   1
##   FALSE  41 188
##   TRUE    4 567

weighted_MR_train <- costfunc_weighted(GermanCredit_train$Class, pred_prob_train, optimal.pcut_w)
print(paste("Weighted Training Cost:", round(weighted_MR_train, 4)))

## [1] "Weighted Training Cost: 0.26"

Your observation: Using the weighted cut-off of 0.22 on the training set, the model correctly classified 41 bad customers and 567 good customers, while misclassifying 188 bad customers as good and only 4 good customers as bad. The weighted training cost is 0.26.

3. Obtain the confusion matrix and MR for the test set.

pred_class_test_w <- (pred_prob_test >= optimal.pcut_w) * 1
conf_test_w <- table(GermanCredit_test$Class, pred_class_test_w, 
                     dnn = c("True", "Predicted"))
print(conf_test_w)

##        Predicted
## True      0   1
##   FALSE  17  54
##   TRUE    5 124

weighted_MR_test <- costfunc_weighted(GermanCredit_test$Class, pred_prob_test, optimal.pcut_w)
print(paste("Weighted Test Cost:", round(weighted_MR_test, 4)))

## [1] "Weighted Test Cost: 0.395"

Your observation: On the test set, the weighted cut-off of 0.22 produces a weighted cost of 0.395. The confusion matrix shows that the model correctly classified 17 bad customers and 124 good customers, but misclassified 54 bad customers as good and 5 good customers as bad.

Task 6: Conlusion (10pts)

Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model further.

-Overall, the logistic regression model performed reasonably well in predicting customer creditworthiness. Under equal weights, the optimal cut-off was 0.39, with a training MR of 0.1962, test MR of 0.2750, training AUC of 0.8505, and test AUC of 0.7353. These results show that the model has good predictive ability, although performance is weaker on the test set, suggesting some overfitting. When different error weights were used, the optimal cut-off dropped to 0.22, which changed the balance of classification errors and reflected the higher cost assigned to one type of mistake. To improve the model further, I would consider variable selection, checking multicollinearity, trying interaction terms, and comparing logistic regression with other classification methods.

Homework4

Blanca Kishi

04/03/2026

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

Homework4

Blanca Kishi

04/03/2026

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as 2024 for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)