Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: 62 variables

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure. (10pts)

str(GermanCredit)    # structure - see variable type

## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...

summary(GermanCredit) # summary statistics

##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker     Class         CheckingAccountStatus.lt.0
##  Min.   :0.000   Mode :logical   Min.   :0.000             
##  1st Qu.:1.000   FALSE:300       1st Qu.:0.000             
##  Median :1.000   TRUE :700       Median :0.000             
##  Mean   :0.963                   Mean   :0.274             
##  3rd Qu.:1.000                   3rd Qu.:1.000             
##  Max.   :1.000                   Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00

colSums(is.na(GermanCredit))

##                           Duration                             Amount 
##                                  0                                  0 
##          InstallmentRatePercentage                  ResidenceDuration 
##                                  0                                  0 
##                                Age              NumberExistingCredits 
##                                  0                                  0 
##            NumberPeopleMaintenance                          Telephone 
##                                  0                                  0 
##                      ForeignWorker                              Class 
##                                  0                                  0 
##         CheckingAccountStatus.lt.0     CheckingAccountStatus.0.to.200 
##                                  0                                  0 
##       CheckingAccountStatus.gt.200     CreditHistory.NoCredit.AllPaid 
##                                  0                                  0 
##     CreditHistory.ThisBank.AllPaid             CreditHistory.PaidDuly 
##                                  0                                  0 
##                CreditHistory.Delay                     Purpose.NewCar 
##                                  0                                  0 
##                    Purpose.UsedCar        Purpose.Furniture.Equipment 
##                                  0                                  0 
##           Purpose.Radio.Television          Purpose.DomesticAppliance 
##                                  0                                  0 
##                    Purpose.Repairs                  Purpose.Education 
##                                  0                                  0 
##                 Purpose.Retraining                   Purpose.Business 
##                                  0                                  0 
##         SavingsAccountBonds.lt.100     SavingsAccountBonds.100.to.500 
##                                  0                                  0 
##    SavingsAccountBonds.500.to.1000        SavingsAccountBonds.gt.1000 
##                                  0                                  0 
##            EmploymentDuration.lt.1          EmploymentDuration.1.to.4 
##                                  0                                  0 
##          EmploymentDuration.4.to.7            EmploymentDuration.gt.7 
##                                  0                                  0 
##   Personal.Male.Divorced.Seperated          Personal.Female.NotSingle 
##                                  0                                  0 
##               Personal.Male.Single        OtherDebtorsGuarantors.None 
##                                  0                                  0 
## OtherDebtorsGuarantors.CoApplicant                Property.RealEstate 
##                                  0                                  0 
##                 Property.Insurance                  Property.CarOther 
##                                  0                                  0 
##         OtherInstallmentPlans.Bank       OtherInstallmentPlans.Stores 
##                                  0                                  0 
##                       Housing.Rent                        Housing.Own 
##                                  0                                  0 
##            Job.UnemployedUnskilled              Job.UnskilledResident 
##                                  0                                  0 
##                Job.SkilledEmployee 
##                                  0

Your observation: after removing the variables with no data there are 49 variables with no missing data. id say there are too many variables with little information but we will see if they are significant once we conduct some tests.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
credit_train = GermanCredit[index,]
credit_test = GermanCredit[-index,]

Your observation: split 80-20

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

glm_credit<- glm(Class~., family=binomial, data=credit_train)

Your observation:

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

summary(glm_credit)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = credit_train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.241e+00  1.719e+00   5.376 7.61e-08 ***
## Duration                           -2.994e-02  1.072e-02  -2.794 0.005214 ** 
## Amount                             -1.771e-04  5.095e-05  -3.475 0.000510 ***
## InstallmentRatePercentage          -3.718e-01  1.036e-01  -3.589 0.000332 ***
## ResidenceDuration                   2.577e-02  1.010e-01   0.255 0.798510    
## Age                                 1.183e-02  1.097e-02   1.078 0.280974    
## NumberExistingCredits              -1.225e-01  2.189e-01  -0.560 0.575690    
## NumberPeopleMaintenance            -1.731e-01  2.945e-01  -0.588 0.556678    
## Telephone                          -4.236e-01  2.371e-01  -1.786 0.074081 .  
## ForeignWorker                      -1.651e+00  7.421e-01  -2.224 0.026143 *  
## CheckingAccountStatus.lt.0         -1.817e+00  2.710e-01  -6.703 2.04e-11 ***
## CheckingAccountStatus.0.to.200     -1.432e+00  2.686e-01  -5.330 9.81e-08 ***
## CheckingAccountStatus.gt.200       -5.912e-01  4.631e-01  -1.277 0.201696    
## CreditHistory.NoCredit.AllPaid     -8.724e-01  5.139e-01  -1.698 0.089584 .  
## CreditHistory.ThisBank.AllPaid     -1.676e+00  5.493e-01  -3.052 0.002277 ** 
## CreditHistory.PaidDuly             -6.686e-01  2.939e-01  -2.275 0.022899 *  
## CreditHistory.Delay                -9.413e-01  3.780e-01  -2.491 0.012756 *  
## Purpose.NewCar                     -1.733e+00  1.013e+00  -1.710 0.087282 .  
## Purpose.UsedCar                     6.716e-02  1.033e+00   0.065 0.948146    
## Purpose.Furniture.Equipment        -8.257e-01  1.015e+00  -0.814 0.415816    
## Purpose.Radio.Television           -8.386e-01  1.019e+00  -0.823 0.410457    
## Purpose.DomesticAppliance          -1.227e+00  1.328e+00  -0.923 0.355762    
## Purpose.Repairs                    -1.321e+00  1.165e+00  -1.134 0.256825    
## Purpose.Education                  -2.020e+00  1.088e+00  -1.857 0.063374 .  
## Purpose.Retraining                  4.276e-01  1.640e+00   0.261 0.794237    
## Purpose.Business                   -8.618e-01  1.032e+00  -0.835 0.403529    
## SavingsAccountBonds.lt.100         -1.266e+00  3.201e-01  -3.956 7.63e-05 ***
## SavingsAccountBonds.100.to.500     -1.075e+00  4.171e-01  -2.577 0.009964 ** 
## SavingsAccountBonds.500.to.1000    -8.768e-01  5.216e-01  -1.681 0.092761 .  
## SavingsAccountBonds.gt.1000         1.301e-02  6.161e-01   0.021 0.983157    
## EmploymentDuration.lt.1             3.581e-01  5.167e-01   0.693 0.488195    
## EmploymentDuration.1.to.4           5.527e-01  5.000e-01   1.105 0.268967    
## EmploymentDuration.4.to.7           9.863e-01  5.355e-01   1.842 0.065524 .  
## EmploymentDuration.gt.7             5.253e-01  5.039e-01   1.042 0.297218    
## Personal.Male.Divorced.Seperated   -2.546e-01  5.214e-01  -0.488 0.625274    
## Personal.Female.NotSingle          -1.274e-01  3.573e-01  -0.357 0.721452    
## Personal.Male.Single                4.118e-01  3.623e-01   1.137 0.255622    
## OtherDebtorsGuarantors.None        -1.239e+00  5.370e-01  -2.308 0.021018 *  
## OtherDebtorsGuarantors.CoApplicant -1.565e+00  6.828e-01  -2.292 0.021919 *  
## Property.RealEstate                 7.166e-01  4.898e-01   1.463 0.143477    
## Property.Insurance                  3.544e-01  4.785e-01   0.741 0.458926    
## Property.CarOther                   6.110e-01  4.648e-01   1.314 0.188702    
## OtherInstallmentPlans.Bank         -8.504e-01  2.730e-01  -3.115 0.001838 ** 
## OtherInstallmentPlans.Stores       -4.293e-01  4.711e-01  -0.911 0.362139    
## Housing.Rent                       -9.538e-01  5.624e-01  -1.696 0.089924 .  
## Housing.Own                        -2.723e-01  5.282e-01  -0.516 0.606157    
## Job.UnemployedUnskilled             1.449e+00  8.788e-01   1.649 0.099175 .  
## Job.UnskilledResident              -2.641e-03  4.101e-01  -0.006 0.994861    
## Job.SkilledEmployee                -1.073e-02  3.349e-01  -0.032 0.974438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 958.02  on 799  degrees of freedom
## Residual deviance: 672.78  on 751  degrees of freedom
## AIC: 770.78
## 
## Number of Fisher Scoring iterations: 5

Your observation: there seem to be many variables that are significant such as duration, amount, installment rate, and having low checking balance. these variables tend to increase the chance of becoming a credit risk

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

train_probs <- predict(glm_credit, newdata = credit_train, type = "response")

Your observation:

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

costfunc <- function(obs, pred.p, pcut) {
  weight_FN = 1    # Define the weight for "true=1 but pred=0" (FN)
  weight_FP = 1    # Define the weight for "true=0 but pred=1" (FP)
  
  FNC <- sum((obs == TRUE) & (pred.p < pcut), na.rm = TRUE)  
  FPC <- sum((obs == FALSE) & (pred.p >= pcut), na.rm = TRUE) 
  
  MR <- sum(weight_FN * FNC + weight_FP * FPC) / length(obs)
  
  if (is.na(MR) || !is.finite(MR)) {
    MR <- 1 
   
    } 
  return(MR)
  }

# Define a sequence of cutoff values from 0.01 to 1, by 0.01
pcut.seq <- seq(0.01, 1, by = 0.01)

MR_vec <- rep(0, length(pcut.seq))

for(i in 1:length(pcut.seq)) { 
  MR_vec[i] <- costfunc(obs = credit_train$class, pred.p = credit_train, pcut = pcut.seq[i])
}

cbind(pcut.seq, MR_vec)

##        pcut.seq MR_vec
##   [1,]     0.01      1
##   [2,]     0.02      1
##   [3,]     0.03      1
##   [4,]     0.04      1
##   [5,]     0.05      1
##   [6,]     0.06      1
##   [7,]     0.07      1
##   [8,]     0.08      1
##   [9,]     0.09      1
##  [10,]     0.10      1
##  [11,]     0.11      1
##  [12,]     0.12      1
##  [13,]     0.13      1
##  [14,]     0.14      1
##  [15,]     0.15      1
##  [16,]     0.16      1
##  [17,]     0.17      1
##  [18,]     0.18      1
##  [19,]     0.19      1
##  [20,]     0.20      1
##  [21,]     0.21      1
##  [22,]     0.22      1
##  [23,]     0.23      1
##  [24,]     0.24      1
##  [25,]     0.25      1
##  [26,]     0.26      1
##  [27,]     0.27      1
##  [28,]     0.28      1
##  [29,]     0.29      1
##  [30,]     0.30      1
##  [31,]     0.31      1
##  [32,]     0.32      1
##  [33,]     0.33      1
##  [34,]     0.34      1
##  [35,]     0.35      1
##  [36,]     0.36      1
##  [37,]     0.37      1
##  [38,]     0.38      1
##  [39,]     0.39      1
##  [40,]     0.40      1
##  [41,]     0.41      1
##  [42,]     0.42      1
##  [43,]     0.43      1
##  [44,]     0.44      1
##  [45,]     0.45      1
##  [46,]     0.46      1
##  [47,]     0.47      1
##  [48,]     0.48      1
##  [49,]     0.49      1
##  [50,]     0.50      1
##  [51,]     0.51      1
##  [52,]     0.52      1
##  [53,]     0.53      1
##  [54,]     0.54      1
##  [55,]     0.55      1
##  [56,]     0.56      1
##  [57,]     0.57      1
##  [58,]     0.58      1
##  [59,]     0.59      1
##  [60,]     0.60      1
##  [61,]     0.61      1
##  [62,]     0.62      1
##  [63,]     0.63      1
##  [64,]     0.64      1
##  [65,]     0.65      1
##  [66,]     0.66      1
##  [67,]     0.67      1
##  [68,]     0.68      1
##  [69,]     0.69      1
##  [70,]     0.70      1
##  [71,]     0.71      1
##  [72,]     0.72      1
##  [73,]     0.73      1
##  [74,]     0.74      1
##  [75,]     0.75      1
##  [76,]     0.76      1
##  [77,]     0.77      1
##  [78,]     0.78      1
##  [79,]     0.79      1
##  [80,]     0.80      1
##  [81,]     0.81      1
##  [82,]     0.82      1
##  [83,]     0.83      1
##  [84,]     0.84      1
##  [85,]     0.85      1
##  [86,]     0.86      1
##  [87,]     0.87      1
##  [88,]     0.88      1
##  [89,]     0.89      1
##  [90,]     0.90      1
##  [91,]     0.91      1
##  [92,]     0.92      1
##  [93,]     0.93      1
##  [94,]     0.94      1
##  [95,]     0.95      1
##  [96,]     0.96      1
##  [97,]     0.97      1
##  [98,]     0.98      1
##  [99,]     0.99      1
## [100,]     1.00      1

plot(pcut.seq, MR_vec, type = "b", main = "Misclassification Rate by Cutoff", 
     xlab = "Cutoff (pcut)", ylab = "Misclassification Rate (MR)")

# Find the optimal cutoff with the minimum misclassification rate
optimal.pcut <- pcut.seq[which.min(MR_vec)]
print(optimal.pcut)

## [1] 0.01

Your observation: i got a 0.1 cutoff which means we only need 10% prob for an observation to be positive

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

pred_prob_credit_train <- predict(glm_credit, newdata = credit_train, type = "response")

class.glm_opt.train <- ifelse(pred_prob_credit_train >= 0.1, TRUE, FALSE)
table(credit_train$Class, class.glm_opt.train, dnn = c("True", "Predicted"))

##        Predicted
## True    FALSE TRUE
##   FALSE    16  213
##   TRUE      1  570

confusion_matrix_train <- table(credit_train$Class, class.glm_opt.train, dnn = c("True", "Predicted"))
print(confusion_matrix_train)

##        Predicted
## True    FALSE TRUE
##   FALSE    16  213
##   TRUE      1  570

MR_train <- sum(confusion_matrix_train[1, 2] + confusion_matrix_train[2, 1]) / sum(confusion_matrix_train)
print(paste("Misclassification Rate (MR) for Training Set:", MR_train))

## [1] "Misclassification Rate (MR) for Training Set: 0.2675"

Your observation: an mr of 0.2675 means that 26.75% of the predictions were wrong. so 73.25% accuracy thats not bad but it could be higher

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

library(ROCR)

pred_train <- prediction(pred_prob_credit_train, credit_train$Class)  

ROC <- performance(pred_train, "tpr", "fpr")
plot(ROC, colorize = TRUE, main = "ROC")

auc_train <- unlist(slot(performance(pred_train, "auc"), "y.values"))
print(paste("AUC", auc_train))

## [1] "AUC 0.850480655251264"

Your observation: with an AUC of 0.8504 that means that the model predicts the risk classes 85% correctly which is effective. the roc was around 73% and AUC was 85% so i think the model could slightly improve

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

pred_prob_credit_test <- predict(glm_credit, newdata = credit_test, type = "response")

pred_class_credit_test_optimal <- ifelse(pred_prob_credit_test > 0.1, TRUE, FALSE)

conf_test <- table(credit_test$Class, pred_class_credit_test_optimal, dnn = c("True", "Predicted"))
print(conf_test)

##        Predicted
## True    FALSE TRUE
##   FALSE     6   65
##   TRUE      3  126

MR_test <- 1 - sum(diag(conf_test)) / sum(conf_test)
print(paste("Misclassification Rate (MR) for Test Set:", MR_test))

## [1] "Misclassification Rate (MR) for Test Set: 0.34"

Your observation: MR meas a 34% incorrect rate which is a increase from the training set it seems that there may be some over fitting in the model

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

pred_test <- prediction(pred_prob_credit_test, credit_test$Class)

ROC_test <- performance(pred_test, "tpr", "fpr")
plot(ROC_test, colorize = TRUE, main = "ROC test")

auc_test <- unlist(slot(performance(pred_test, "auc"), "y.values"))
print(paste("AUC test", auc_test))

## [1] "AUC test 0.735342286275794"

Your observation: the AUC is 73.53% which is still acceptable but is lower that the training which means that the model could be improved.

Task 5: Using different weights (20pts)

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

# Redefine the cost function with updated weights (more weight on FN)
costfunc <- function(obs, pred.p, pcut) {
    weight_FN = 5  # Higher weight for "true=1 but pred=0" (FN)
    weight_FP = 1  # Lower weight for "true=0 but pred=1" (FP)
    
    # Count FN and FP based on the cutoff
    FNC <- sum((obs == 1) & (pred.p < pcut))  # FN count
    FPC <- sum((obs == 0) & (pred.p >= pcut))  # FP count
    
    # Weighted misclassification rate
    MR <- sum(weight_FN * FNC + weight_FP * FPC) / length(obs)
    return(MR)
}

# Define a sequence of cutoff values from 0.01 to 1, in steps of 0.01
pcut.seq <- seq(0.01, 1, by = 0.01)

# Calculate the weighted misclassification rate (MR) for each cutoff
MR_vec <- sapply(pcut.seq, function(cutoff) {
    costfunc(obs = credit_train$Class, pred.p = pred_prob_credit_train, pcut = cutoff)
})

# Find the optimal cutoff with the minimum weighted misclassification rate
optimal_cutoff <- pcut.seq[which.min(MR_vec)]
print(paste("Optimal Cutoff with Weights:", optimal_cutoff))

## [1] "Optimal Cutoff with Weights: 0.22"

Your observation: with a weight change the model now puts more value on making a false negative as its more costly this better prevent false negatives but can lead to more false positives

2. Obtain the confusion matrix and MR for the training set.

pred_class_credit_train_optimal <- ifelse(pred_prob_credit_train > 0.22, TRUE, FALSE)
conf_train <- table(credit_train$Class, pred_class_credit_train_optimal, dnn = c("True", "Predicted"))
print(conf_train)

##        Predicted
## True    FALSE TRUE
##   FALSE    41  188
##   TRUE      4  567

MR_train <- costfunc(obs = credit_train$Class, pred.p = pred_prob_credit_train, pcut = 0.22)
print(paste("Weighted MR for Training Set:", MR_train))

## [1] "Weighted MR for Training Set: 0.26"

Your observation: the new MR 0.26 is almost the same but it slightly increased that means the model has slightly stricter conditions of a non credit risk

3. Obtain the confusion matrix and MR for the test set.

pred_prob_credit_test <- predict(glm_credit, newdata = credit_test, type = "response")
pred_class_credit_test_optimal <- ifelse(pred_prob_credit_test > 0.22, TRUE, FALSE)
conf_test <- table(credit_test$Class, pred_class_credit_test_optimal, dnn = c("True", "Predicted"))
print(conf_test)

##        Predicted
## True    FALSE TRUE
##   FALSE    17   54
##   TRUE      5  124

MR_test <- costfunc(obs = credit_test$Class, pred.p = pred_prob_credit_test, pcut = 0.22)
print(paste("Weighted MR for Test Set:", MR_test))

## [1] "Weighted MR for Test Set: 0.395"

Your observation: the new MR 0.395 increased from 0.34 which means its trying to prevent more false negatives but it leads to a higher MR

Task 6: Conlusion (10pts)

Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model.

In this analysis, we aimed to develop a model that balances accuracy with the cost of misclassifications, prioritizing false negatives (misclassifying a bad customer as good) as five times more costly than false positives. With equal weights on false positives and false negatives. the optimal probability cutoff was 0.1, with an MR of 0.2675 and a test MR of 0.34, with AUCs of 0.8504 training and 0.7353 test, indicating good discrimination ability on both.

With the weighted cost structure, the optimal cutoff shifted to 0.22, prioritizing caution in classifying customers as good. This resulted in a weighted MR of 0.26 for the training set and 0.395 for the test set, showing a trade-off where the model became more conservative but less generalizable. To improve the model we need keep it conservative but make it more accurate maybe by removing some of the less significant variables and more testing

HW4

Zion P

10/23/2024

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

HW4

Zion P

10/23/2024

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as 2024 for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)