Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: A majority of variables are numeric based, more specific, Binary. We could have done the same to them that we did for class, but since they are predictor values, this would be counter intuitive. We have a relatively large data set, with a lot of variables, a 1,000 to 49 count.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure. (10pts)

colSums(is.na(GermanCredit)) #the code book says there are no missing, but double checking

##                           Duration                             Amount 
##                                  0                                  0 
##          InstallmentRatePercentage                  ResidenceDuration 
##                                  0                                  0 
##                                Age              NumberExistingCredits 
##                                  0                                  0 
##            NumberPeopleMaintenance                          Telephone 
##                                  0                                  0 
##                      ForeignWorker                              Class 
##                                  0                                  0 
##         CheckingAccountStatus.lt.0     CheckingAccountStatus.0.to.200 
##                                  0                                  0 
##       CheckingAccountStatus.gt.200     CreditHistory.NoCredit.AllPaid 
##                                  0                                  0 
##     CreditHistory.ThisBank.AllPaid             CreditHistory.PaidDuly 
##                                  0                                  0 
##                CreditHistory.Delay                     Purpose.NewCar 
##                                  0                                  0 
##                    Purpose.UsedCar        Purpose.Furniture.Equipment 
##                                  0                                  0 
##           Purpose.Radio.Television          Purpose.DomesticAppliance 
##                                  0                                  0 
##                    Purpose.Repairs                  Purpose.Education 
##                                  0                                  0 
##                 Purpose.Retraining                   Purpose.Business 
##                                  0                                  0 
##         SavingsAccountBonds.lt.100     SavingsAccountBonds.100.to.500 
##                                  0                                  0 
##    SavingsAccountBonds.500.to.1000        SavingsAccountBonds.gt.1000 
##                                  0                                  0 
##            EmploymentDuration.lt.1          EmploymentDuration.1.to.4 
##                                  0                                  0 
##          EmploymentDuration.4.to.7            EmploymentDuration.gt.7 
##                                  0                                  0 
##   Personal.Male.Divorced.Seperated          Personal.Female.NotSingle 
##                                  0                                  0 
##               Personal.Male.Single        OtherDebtorsGuarantors.None 
##                                  0                                  0 
## OtherDebtorsGuarantors.CoApplicant                Property.RealEstate 
##                                  0                                  0 
##                 Property.Insurance                  Property.CarOther 
##                                  0                                  0 
##         OtherInstallmentPlans.Bank       OtherInstallmentPlans.Stores 
##                                  0                                  0 
##                       Housing.Rent                        Housing.Own 
##                                  0                                  0 
##            Job.UnemployedUnskilled              Job.UnskilledResident 
##                                  0                                  0 
##                Job.SkilledEmployee 
##                                  0

str(GermanCredit) #re-iterate the structure of the data set

## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...

summary(GermanCredit) #A quire the summ stats

##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker     Class         CheckingAccountStatus.lt.0
##  Min.   :0.000   Mode :logical   Min.   :0.000             
##  1st Qu.:1.000   FALSE:300       1st Qu.:0.000             
##  Median :1.000   TRUE :700       Median :0.000             
##  Mean   :0.963                   Mean   :0.274             
##  3rd Qu.:1.000                   3rd Qu.:1.000             
##  Max.   :1.000                   Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00

barplot(table(GermanCredit$Class), # our focus will be on `Class`, visual rep or proportion.
        ylab = "Frequency",
        xlab = "Class")

Your observation: When going through the code book, I saw that there were no missings, but I wanted to double check, and it was correct. Then I ran through the structure one more time, then I loaded the summary stats to find min’s, max’s and any variable that seems out of place. From what I can tell, there aren’t any points that fall out of the norm. I then made a bar chart based on our Class variable (due to our us naming it as the response variable), we can a majority of observations fall on True.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80) #using a 80 20 split
German_train = GermanCredit[index,]
German_test = GermanCredit[-index,]

Your observation: Used an 80-20 percent split

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

glm_german<- glm(Class ~. , family=binomial, data=German_train)

Your observation: I used the glm function to create a logistic equation, using all the variables in the training data set.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

summary(glm_german)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = German_train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.241e+00  1.719e+00   5.376 7.61e-08 ***
## Duration                           -2.994e-02  1.072e-02  -2.794 0.005214 ** 
## Amount                             -1.771e-04  5.095e-05  -3.475 0.000510 ***
## InstallmentRatePercentage          -3.718e-01  1.036e-01  -3.589 0.000332 ***
## ResidenceDuration                   2.577e-02  1.010e-01   0.255 0.798510    
## Age                                 1.183e-02  1.097e-02   1.078 0.280974    
## NumberExistingCredits              -1.225e-01  2.189e-01  -0.560 0.575690    
## NumberPeopleMaintenance            -1.731e-01  2.945e-01  -0.588 0.556678    
## Telephone                          -4.236e-01  2.371e-01  -1.786 0.074081 .  
## ForeignWorker                      -1.651e+00  7.421e-01  -2.224 0.026143 *  
## CheckingAccountStatus.lt.0         -1.817e+00  2.710e-01  -6.703 2.04e-11 ***
## CheckingAccountStatus.0.to.200     -1.432e+00  2.686e-01  -5.330 9.81e-08 ***
## CheckingAccountStatus.gt.200       -5.912e-01  4.631e-01  -1.277 0.201696    
## CreditHistory.NoCredit.AllPaid     -8.724e-01  5.139e-01  -1.698 0.089584 .  
## CreditHistory.ThisBank.AllPaid     -1.676e+00  5.493e-01  -3.052 0.002277 ** 
## CreditHistory.PaidDuly             -6.686e-01  2.939e-01  -2.275 0.022899 *  
## CreditHistory.Delay                -9.413e-01  3.780e-01  -2.491 0.012756 *  
## Purpose.NewCar                     -1.733e+00  1.013e+00  -1.710 0.087282 .  
## Purpose.UsedCar                     6.716e-02  1.033e+00   0.065 0.948146    
## Purpose.Furniture.Equipment        -8.257e-01  1.015e+00  -0.814 0.415816    
## Purpose.Radio.Television           -8.386e-01  1.019e+00  -0.823 0.410457    
## Purpose.DomesticAppliance          -1.227e+00  1.328e+00  -0.923 0.355762    
## Purpose.Repairs                    -1.321e+00  1.165e+00  -1.134 0.256825    
## Purpose.Education                  -2.020e+00  1.088e+00  -1.857 0.063374 .  
## Purpose.Retraining                  4.276e-01  1.640e+00   0.261 0.794237    
## Purpose.Business                   -8.618e-01  1.032e+00  -0.835 0.403529    
## SavingsAccountBonds.lt.100         -1.266e+00  3.201e-01  -3.956 7.63e-05 ***
## SavingsAccountBonds.100.to.500     -1.075e+00  4.171e-01  -2.577 0.009964 ** 
## SavingsAccountBonds.500.to.1000    -8.768e-01  5.216e-01  -1.681 0.092761 .  
## SavingsAccountBonds.gt.1000         1.301e-02  6.161e-01   0.021 0.983157    
## EmploymentDuration.lt.1             3.581e-01  5.167e-01   0.693 0.488195    
## EmploymentDuration.1.to.4           5.527e-01  5.000e-01   1.105 0.268967    
## EmploymentDuration.4.to.7           9.863e-01  5.355e-01   1.842 0.065524 .  
## EmploymentDuration.gt.7             5.253e-01  5.039e-01   1.042 0.297218    
## Personal.Male.Divorced.Seperated   -2.546e-01  5.214e-01  -0.488 0.625274    
## Personal.Female.NotSingle          -1.274e-01  3.573e-01  -0.357 0.721452    
## Personal.Male.Single                4.118e-01  3.623e-01   1.137 0.255622    
## OtherDebtorsGuarantors.None        -1.239e+00  5.370e-01  -2.308 0.021018 *  
## OtherDebtorsGuarantors.CoApplicant -1.565e+00  6.828e-01  -2.292 0.021919 *  
## Property.RealEstate                 7.166e-01  4.898e-01   1.463 0.143477    
## Property.Insurance                  3.544e-01  4.785e-01   0.741 0.458926    
## Property.CarOther                   6.110e-01  4.648e-01   1.314 0.188702    
## OtherInstallmentPlans.Bank         -8.504e-01  2.730e-01  -3.115 0.001838 ** 
## OtherInstallmentPlans.Stores       -4.293e-01  4.711e-01  -0.911 0.362139    
## Housing.Rent                       -9.538e-01  5.624e-01  -1.696 0.089924 .  
## Housing.Own                        -2.723e-01  5.282e-01  -0.516 0.606157    
## Job.UnemployedUnskilled             1.449e+00  8.788e-01   1.649 0.099175 .  
## Job.UnskilledResident              -2.641e-03  4.101e-01  -0.006 0.994861    
## Job.SkilledEmployee                -1.073e-02  3.349e-01  -0.032 0.974438    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 958.02  on 799  degrees of freedom
## Residual deviance: 672.78  on 751  degrees of freedom
## AIC: 770.78
## 
## Number of Fisher Scoring iterations: 5

Your observation: A multitude of Variables have a significant Impact on Class or classification on credit risk, but most variables have an inverse (negative) affect with the response variable. Such as Job.UnemployedUnskilled, as Job.UnemployedUnskilled goes up by 1, it lowers/subtracts from the overall amount/Class amount. Meaning if the individual is unemployed and has no skills, decreasing the intercept, meaning their credit will fall.

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

pred_prob_german_train <- predict(glm_german, type="response")
hist(pred_prob_german_train)

Your observation: We can see that this very left skewed, meaning that our cut off value will be on the higher side.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

# define MR (or a cost function) with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight_FN = 1    # define the weight for "true=1 but pred=0" (FN)
    weight_FP = 1    # define the weight for "true=0 but pred=1" (FP)
    FNC = sum( (obs==1) & (pred.p < pcut))   # count for "true=1 but pred=0"   (FN)
    FPC = sum( (obs==0) & (pred.p >=pcut))   # count for "true=0 but pred=1"   (FP)
    MR  = sum(weight_FN*FNC + weight_FP*FPC) / length(obs)  # misclassification with weight
    return(MR) # you have to return to a value when you write R functions
}

# define a sequence from 0.01 to 1 by 0.01
pcut.seq = seq(0.01, 1, 0.01)

# write a loop for all p-cut to see which one provides the smallest cost
# first, need to define a 0 vector in order to save the value of cost from all pcut
MR_vec = rep(0, length(pcut.seq))  
for(i in 1:length(pcut.seq)){ 
    MR_vec[i] = costfunc(obs = German_train$Class, pred.p = pred_prob_german_train, pcut = pcut.seq[i])  
} # end of the loop

cbind(pcut.seq,MR_vec)

##        pcut.seq  MR_vec
##   [1,]     0.01 0.28625
##   [2,]     0.02 0.28625
##   [3,]     0.03 0.28625
##   [4,]     0.04 0.27750
##   [5,]     0.05 0.27750
##   [6,]     0.06 0.27750
##   [7,]     0.07 0.27625
##   [8,]     0.08 0.27375
##   [9,]     0.09 0.27250
##  [10,]     0.10 0.26750
##  [11,]     0.11 0.26375
##  [12,]     0.12 0.26375
##  [13,]     0.13 0.26375
##  [14,]     0.14 0.26250
##  [15,]     0.15 0.26250
##  [16,]     0.16 0.26250
##  [17,]     0.17 0.26125
##  [18,]     0.18 0.25500
##  [19,]     0.19 0.25250
##  [20,]     0.20 0.25000
##  [21,]     0.21 0.24500
##  [22,]     0.22 0.24000
##  [23,]     0.23 0.24125
##  [24,]     0.24 0.23875
##  [25,]     0.25 0.23250
##  [26,]     0.26 0.22750
##  [27,]     0.27 0.22625
##  [28,]     0.28 0.22625
##  [29,]     0.29 0.22500
##  [30,]     0.30 0.22250
##  [31,]     0.31 0.22250
##  [32,]     0.32 0.21750
##  [33,]     0.33 0.21625
##  [34,]     0.34 0.21000
##  [35,]     0.35 0.20625
##  [36,]     0.36 0.20500
##  [37,]     0.37 0.20500
##  [38,]     0.38 0.20500
##  [39,]     0.39 0.19625
##  [40,]     0.40 0.19625
##  [41,]     0.41 0.20000
##  [42,]     0.42 0.20000
##  [43,]     0.43 0.20000
##  [44,]     0.44 0.20250
##  [45,]     0.45 0.20125
##  [46,]     0.46 0.20250
##  [47,]     0.47 0.20250
##  [48,]     0.48 0.19750
##  [49,]     0.49 0.19750
##  [50,]     0.50 0.19625
##  [51,]     0.51 0.19875
##  [52,]     0.52 0.20000
##  [53,]     0.53 0.20000
##  [54,]     0.54 0.19875
##  [55,]     0.55 0.19875
##  [56,]     0.56 0.20000
##  [57,]     0.57 0.20750
##  [58,]     0.58 0.21250
##  [59,]     0.59 0.21500
##  [60,]     0.60 0.21750
##  [61,]     0.61 0.22375
##  [62,]     0.62 0.22250
##  [63,]     0.63 0.22625
##  [64,]     0.64 0.23000
##  [65,]     0.65 0.23125
##  [66,]     0.66 0.23500
##  [67,]     0.67 0.24000
##  [68,]     0.68 0.24250
##  [69,]     0.69 0.24250
##  [70,]     0.70 0.24625
##  [71,]     0.71 0.24625
##  [72,]     0.72 0.24750
##  [73,]     0.73 0.24875
##  [74,]     0.74 0.24875
##  [75,]     0.75 0.26000
##  [76,]     0.76 0.25875
##  [77,]     0.77 0.26125
##  [78,]     0.78 0.26875
##  [79,]     0.79 0.27125
##  [80,]     0.80 0.27250
##  [81,]     0.81 0.28250
##  [82,]     0.82 0.29000
##  [83,]     0.83 0.30000
##  [84,]     0.84 0.31000
##  [85,]     0.85 0.31625
##  [86,]     0.86 0.32875
##  [87,]     0.87 0.34000
##  [88,]     0.88 0.35500
##  [89,]     0.89 0.37000
##  [90,]     0.90 0.38875
##  [91,]     0.91 0.41750
##  [92,]     0.92 0.43625
##  [93,]     0.93 0.45000
##  [94,]     0.94 0.48625
##  [95,]     0.95 0.52250
##  [96,]     0.96 0.54625
##  [97,]     0.97 0.58750
##  [98,]     0.98 0.62875
##  [99,]     0.99 0.68250
## [100,]     1.00 0.71375

# draw a plot with X axis being all pcut and Y axis being associated cost
plot(pcut.seq, MR_vec)

# find the optimal pcut
optimal.pcut = pcut.seq[which(MR_vec==min(MR_vec))]
print(optimal.pcut)

## [1] 0.39 0.40 0.50

Your observation: We have 3 possible cut off values, depending on how confident we are with the model, with a smaller value we predict heavier on True, but might end up with more False Positives. Or with a higher value we predict heavier on False values, but we might end up with more False Negatives, which may be worse.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

# step 1. get binary classification
pred_class_german_train_optimal <- (pred_prob_german_train>0.39)*1
# step 2. get confusion matrix, MR, FPR, FNR
conf_train <- table(German_train$Class, pred_class_german_train_optimal, dnn = c("True", "Predicted"))
conf_train
MR<- 1 - sum(diag(conf_train)) / sum(conf_train)

weighted_MR <- costfunc(obs = German_train$Class, pred.p = pred_prob_german_train, pcut = 1)

FPR<- conf_train[1, 2] / (conf_train[1, 2] + conf_train[1, 1])

FNR<- conf_train[2, 1] / (conf_train[2, 1] + conf_train[2, 2])
print(paste0("MR:",MR))
print(paste0("Weighted_MR:",weighted_MR))
print(paste0("FPR:",FPR))
print(paste0("FNR:",FNR))

Your observation: After looking at each cut point, I observed that .39 is a better choice, because it still has higher values of true positives and negatives, but keeps the arguably worse outcome (False Negatives) at the lowest it is allowed to be.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

install.packages('ROCR')
# don't want to install the package on others who run this notebook, so in eval = FALSE they don't actually install the package on their computer

library(ROCR)
pred_train <- prediction(pred_prob_german_train, German_train$Class)
pred_train

## A prediction instance
##   with 800 data points

ROC <- performance(pred_train, "tpr", "fpr")
ROC@x.values

## [[1]]
##   [1] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   [7] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [13] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [19] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [25] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [31] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [37] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [43] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [49] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [55] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [61] 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##  [67] 0.000000000 0.000000000 0.000000000 0.004366812 0.004366812 0.004366812
##  [73] 0.004366812 0.004366812 0.004366812 0.004366812 0.004366812 0.004366812
##  [79] 0.004366812 0.004366812 0.004366812 0.004366812 0.008733624 0.008733624
##  [85] 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624
##  [91] 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624
##  [97] 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624
## [103] 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624
## [109] 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624 0.008733624
## [115] 0.008733624 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437
## [121] 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437
## [127] 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437
## [133] 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437
## [139] 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437
## [145] 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437 0.013100437
## [151] 0.013100437 0.017467249 0.017467249 0.017467249 0.017467249 0.017467249
## [157] 0.017467249 0.017467249 0.017467249 0.017467249 0.017467249 0.017467249
## [163] 0.017467249 0.017467249 0.017467249 0.017467249 0.017467249 0.017467249
## [169] 0.017467249 0.017467249 0.021834061 0.021834061 0.021834061 0.021834061
## [175] 0.021834061 0.021834061 0.021834061 0.021834061 0.021834061 0.021834061
## [181] 0.021834061 0.021834061 0.021834061 0.021834061 0.021834061 0.021834061
## [187] 0.021834061 0.021834061 0.021834061 0.021834061 0.021834061 0.021834061
## [193] 0.021834061 0.021834061 0.026200873 0.026200873 0.026200873 0.026200873
## [199] 0.026200873 0.026200873 0.026200873 0.026200873 0.026200873 0.026200873
## [205] 0.026200873 0.026200873 0.026200873 0.026200873 0.026200873 0.026200873
## [211] 0.026200873 0.026200873 0.026200873 0.026200873 0.026200873 0.026200873
## [217] 0.030567686 0.030567686 0.030567686 0.030567686 0.030567686 0.030567686
## [223] 0.030567686 0.030567686 0.030567686 0.030567686 0.034934498 0.034934498
## [229] 0.034934498 0.034934498 0.034934498 0.034934498 0.034934498 0.034934498
## [235] 0.034934498 0.034934498 0.034934498 0.034934498 0.034934498 0.034934498
## [241] 0.034934498 0.034934498 0.039301310 0.043668122 0.043668122 0.043668122
## [247] 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122
## [253] 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122
## [259] 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122
## [265] 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122
## [271] 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122
## [277] 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122
## [283] 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122
## [289] 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122 0.043668122
## [295] 0.043668122 0.043668122 0.043668122 0.048034934 0.048034934 0.048034934
## [301] 0.048034934 0.048034934 0.048034934 0.048034934 0.048034934 0.048034934
## [307] 0.048034934 0.048034934 0.048034934 0.052401747 0.052401747 0.052401747
## [313] 0.056768559 0.056768559 0.056768559 0.056768559 0.056768559 0.056768559
## [319] 0.056768559 0.056768559 0.056768559 0.056768559 0.056768559 0.061135371
## [325] 0.061135371 0.061135371 0.061135371 0.061135371 0.061135371 0.061135371
## [331] 0.061135371 0.061135371 0.061135371 0.061135371 0.065502183 0.065502183
## [337] 0.065502183 0.069868996 0.069868996 0.069868996 0.069868996 0.074235808
## [343] 0.074235808 0.074235808 0.074235808 0.074235808 0.074235808 0.074235808
## [349] 0.074235808 0.078602620 0.078602620 0.078602620 0.078602620 0.078602620
## [355] 0.078602620 0.078602620 0.078602620 0.078602620 0.078602620 0.082969432
## [361] 0.082969432 0.082969432 0.087336245 0.087336245 0.087336245 0.091703057
## [367] 0.096069869 0.096069869 0.096069869 0.096069869 0.096069869 0.096069869
## [373] 0.096069869 0.096069869 0.096069869 0.096069869 0.096069869 0.100436681
## [379] 0.100436681 0.100436681 0.100436681 0.100436681 0.100436681 0.100436681
## [385] 0.100436681 0.100436681 0.104803493 0.104803493 0.104803493 0.109170306
## [391] 0.109170306 0.109170306 0.109170306 0.109170306 0.113537118 0.113537118
## [397] 0.113537118 0.113537118 0.113537118 0.113537118 0.113537118 0.113537118
## [403] 0.113537118 0.113537118 0.113537118 0.113537118 0.117903930 0.117903930
## [409] 0.117903930 0.117903930 0.122270742 0.126637555 0.126637555 0.126637555
## [415] 0.131004367 0.135371179 0.135371179 0.135371179 0.135371179 0.135371179
## [421] 0.139737991 0.139737991 0.139737991 0.139737991 0.139737991 0.139737991
## [427] 0.139737991 0.139737991 0.144104803 0.148471616 0.152838428 0.152838428
## [433] 0.152838428 0.157205240 0.157205240 0.157205240 0.157205240 0.161572052
## [439] 0.161572052 0.165938865 0.165938865 0.165938865 0.170305677 0.174672489
## [445] 0.179039301 0.179039301 0.183406114 0.183406114 0.183406114 0.183406114
## [451] 0.183406114 0.183406114 0.187772926 0.187772926 0.187772926 0.187772926
## [457] 0.187772926 0.187772926 0.187772926 0.187772926 0.192139738 0.192139738
## [463] 0.192139738 0.196506550 0.196506550 0.196506550 0.200873362 0.205240175
## [469] 0.209606987 0.209606987 0.213973799 0.218340611 0.222707424 0.222707424
## [475] 0.222707424 0.222707424 0.227074236 0.227074236 0.231441048 0.231441048
## [481] 0.231441048 0.231441048 0.235807860 0.240174672 0.240174672 0.240174672
## [487] 0.244541485 0.248908297 0.248908297 0.248908297 0.253275109 0.257641921
## [493] 0.257641921 0.257641921 0.257641921 0.257641921 0.257641921 0.262008734
## [499] 0.266375546 0.266375546 0.266375546 0.266375546 0.270742358 0.275109170
## [505] 0.275109170 0.275109170 0.279475983 0.279475983 0.279475983 0.279475983
## [511] 0.279475983 0.279475983 0.279475983 0.279475983 0.283842795 0.288209607
## [517] 0.288209607 0.288209607 0.292576419 0.296943231 0.301310044 0.301310044
## [523] 0.301310044 0.305676856 0.305676856 0.305676856 0.305676856 0.305676856
## [529] 0.310043668 0.310043668 0.310043668 0.310043668 0.314410480 0.318777293
## [535] 0.318777293 0.318777293 0.318777293 0.318777293 0.323144105 0.323144105
## [541] 0.323144105 0.323144105 0.327510917 0.327510917 0.331877729 0.331877729
## [547] 0.336244541 0.340611354 0.340611354 0.344978166 0.344978166 0.349344978
## [553] 0.349344978 0.353711790 0.353711790 0.353711790 0.353711790 0.353711790
## [559] 0.353711790 0.353711790 0.353711790 0.358078603 0.358078603 0.358078603
## [565] 0.358078603 0.362445415 0.366812227 0.366812227 0.371179039 0.371179039
## [571] 0.375545852 0.375545852 0.375545852 0.375545852 0.375545852 0.375545852
## [577] 0.375545852 0.375545852 0.379912664 0.379912664 0.379912664 0.379912664
## [583] 0.379912664 0.379912664 0.379912664 0.379912664 0.384279476 0.384279476
## [589] 0.384279476 0.388646288 0.393013100 0.393013100 0.397379913 0.397379913
## [595] 0.397379913 0.401746725 0.401746725 0.401746725 0.406113537 0.406113537
## [601] 0.410480349 0.414847162 0.419213974 0.419213974 0.419213974 0.419213974
## [607] 0.419213974 0.423580786 0.423580786 0.427947598 0.432314410 0.436681223
## [613] 0.441048035 0.445414847 0.445414847 0.445414847 0.445414847 0.445414847
## [619] 0.445414847 0.449781659 0.454148472 0.454148472 0.458515284 0.458515284
## [625] 0.462882096 0.467248908 0.467248908 0.467248908 0.471615721 0.471615721
## [631] 0.475982533 0.480349345 0.484716157 0.489082969 0.493449782 0.493449782
## [637] 0.493449782 0.497816594 0.497816594 0.502183406 0.502183406 0.506550218
## [643] 0.506550218 0.506550218 0.510917031 0.515283843 0.515283843 0.515283843
## [649] 0.519650655 0.519650655 0.519650655 0.524017467 0.524017467 0.528384279
## [655] 0.532751092 0.532751092 0.532751092 0.532751092 0.532751092 0.532751092
## [661] 0.537117904 0.537117904 0.541484716 0.545851528 0.550218341 0.554585153
## [667] 0.558951965 0.563318777 0.567685590 0.572052402 0.576419214 0.580786026
## [673] 0.580786026 0.580786026 0.580786026 0.580786026 0.585152838 0.589519651
## [679] 0.589519651 0.589519651 0.593886463 0.598253275 0.602620087 0.602620087
## [685] 0.606986900 0.606986900 0.606986900 0.611353712 0.615720524 0.615720524
## [691] 0.620087336 0.620087336 0.624454148 0.628820961 0.633187773 0.637554585
## [697] 0.641921397 0.646288210 0.650655022 0.650655022 0.650655022 0.650655022
## [703] 0.655021834 0.659388646 0.663755459 0.668122271 0.672489083 0.672489083
## [709] 0.676855895 0.681222707 0.685589520 0.685589520 0.689956332 0.694323144
## [715] 0.698689956 0.698689956 0.703056769 0.707423581 0.711790393 0.716157205
## [721] 0.716157205 0.716157205 0.720524017 0.724890830 0.729257642 0.729257642
## [727] 0.729257642 0.733624454 0.737991266 0.737991266 0.737991266 0.742358079
## [733] 0.746724891 0.746724891 0.751091703 0.755458515 0.759825328 0.759825328
## [739] 0.764192140 0.768558952 0.772925764 0.777292576 0.781659389 0.786026201
## [745] 0.790393013 0.794759825 0.799126638 0.799126638 0.803493450 0.807860262
## [751] 0.812227074 0.816593886 0.816593886 0.816593886 0.816593886 0.820960699
## [757] 0.825327511 0.829694323 0.834061135 0.838427948 0.842794760 0.847161572
## [763] 0.851528384 0.855895197 0.860262009 0.860262009 0.864628821 0.868995633
## [769] 0.873362445 0.877729258 0.882096070 0.886462882 0.890829694 0.895196507
## [775] 0.899563319 0.903930131 0.903930131 0.908296943 0.912663755 0.912663755
## [781] 0.917030568 0.921397380 0.925764192 0.930131004 0.934497817 0.938864629
## [787] 0.943231441 0.947598253 0.951965066 0.951965066 0.956331878 0.960698690
## [793] 0.965065502 0.969432314 0.973799127 0.978165939 0.982532751 0.986899563
## [799] 0.991266376 0.995633188 1.000000000

plot(ROC, colorize=TRUE)

#Get the AUC
auc_train = unlist(slot(performance(pred_train, "auc"), "y.values")) #unlist makes it into a number
auc_train

## [1] 0.8504807

Your observation: Based on the AUC level this model, at the 0.39 cut off level, predicts good of the model with it leaning closer on the Excellent side.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

pred_prob_german_test<- predict(glm_german, newdata = German_test, type="response") # if type isn't there we would get a linear (x * beta)

# step 1. get binary classification with the new optimal p cut value
pred_class_german_test_optimal <- (pred_prob_german_test>0.39)*1
# step 2. get confusion matrix, MR, FPR, FNR
conf_test <- table(German_test$Class, pred_class_german_test_optimal, dnn = c("True", "Predicted"))
conf_test

MR<- 1 - sum(diag(conf_test)) / sum(conf_test)

weighted_MR <- costfunc(obs = German_test$Class, pred.p = pred_prob_german_test, pcut = 1)

FPR<- conf_test[1, 2] / (conf_test[1, 2] + conf_test[1, 1])

FNR<- conf_test[2, 1] / (conf_test[2, 1] + conf_test[2, 2])
print(paste0("Testing MR:",MR))
print(paste0("Testing Weighted_MR:",weighted_MR))
print(paste0("Testing FPR:",FPR))
print(paste0("Testing FNR:",FNR))

Your observation: Using the agreed upon 0.39 cut off value, and now using the testing data set, we can see that we have higher True positives than True Negatives, meaning we can correctly classify Whether someone is at good or bad credit risk. While keeping the False Negatives in the lower end.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

pred_test <- prediction(pred_prob_german_test, German_test$Class)
ROC_test <- performance(pred_test, "tpr", "fpr")
plot(ROC_test, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred_test, "auc"), "y.values"))

## [1] 0.7353423

Your observation: Using the testing data, we can only predict the model/values fairly, which is worse than the training data.

Task 5: Using different weights (20pts)

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

# what this code chunck does

#redefine the costfunc to reflect different weight, more weigt on FN now
costfunc = function(obs, pred.p, pcut){
    weight_FN = 1    # define the weight for "true=1 but pred=0" (FN)
    weight_FP = 5    # define the weight for "true=0 but pred=1" (FP)
    FNC = sum( (obs==1) & (pred.p < pcut))   # count for "true=1 but pred=0"   (FN)
    FPC = sum( (obs==0) & (pred.p >=pcut))   # count for "true=0 but pred=1"   (FP)
    MR  = sum(weight_FN*FNC + weight_FP*FPC) / length(obs)  # misclassification with weight
    return(MR) # you have to return to a value when you write R functions
} 
# define a sequence from 0.01 to 1 by 0.01
pcut.seq = seq(0.01, 1, 0.01) 
# obtain MR_vec
MR_vec = rep(0, length(pcut.seq))  
for(i in 1:length(pcut.seq)){ 
    MR_vec[i] = costfunc(obs = German_train$Class, pred.p = pred_prob_german_train, pcut = pcut.seq[i])  
}
# draw a plot with X axis being all pcut and Y axis being associated cost
plot(pcut.seq, MR_vec)

# find the optimal pcut
optimal.pcut = pcut.seq[which(MR_vec==min(MR_vec))]
print(optimal.pcut)

## [1] 0.8

Your observation: Classifying someone as good when they are actually bad is a a False Positive, to which in this scenario is worse than classifying someone as bad when they are actually true. We now have a new cut off value based on the new weight distribution, which is now a cut off value of 0.8.

2. Obtain the confusion matrix and MR for the training set.

# step 1. get binary classification
pred_class_german_train_optimal <- (pred_prob_german_train>0.8)*1
# step 2. get confusion matrix, MR, FPR, FNR
conf_train <- table(German_train$Class, pred_class_german_train_optimal, dnn = c("True", "Predicted"))
conf_train
MR<- 1 - sum(diag(conf_train)) / sum(conf_train)

weighted_MR <- costfunc(obs = German_train$Class, pred.p = pred_prob_german_train, pcut = 1)

FPR<- conf_train[1, 2] / (conf_train[1, 2] + conf_train[1, 1])

FNR<- conf_train[2, 1] / (conf_train[2, 1] + conf_train[2, 2])
print(paste0("MR:",MR))
print(paste0("Weighted_MR:",weighted_MR))
print(paste0("FPR:",FPR))
print(paste0("FNR:",FNR))

Your observation: With heavier emphasis on False Positive, we see a considerable decrease in the number of predicted true but are actually false, but now all other amounts have drastically increased, with False Negative rate raising.

3. Obtain the confusion matrix and MR for the test set.

# step 1. get binary classification with the new optimal p cut value
pred_class_german_test_optimal <- (pred_prob_german_test>0.8)*1
# step 2. get confusion matrix, MR, FPR, FNR
conf_test <- table(German_test$Class, pred_class_german_test_optimal, dnn = c("True", "Predicted"))
conf_test

MR<- 1 - sum(diag(conf_test)) / sum(conf_test)

weighted_MR <- costfunc(obs = German_test$Class, pred.p = pred_prob_german_test, pcut = 1)

FPR<- conf_test[1, 2] / (conf_test[1, 2] + conf_test[1, 1])

FNR<- conf_test[2, 1] / (conf_test[2, 1] + conf_test[2, 2])
print(paste0("Testing MR:",MR))
print(paste0("Testing Weighted_MR:",weighted_MR))
print(paste0("Testing FPR:",FPR))
print(paste0("Testing FNR:",FNR))

Your observation: The same conclusion as with the training set, there are less False Positives, but the False Negatives and its rate is larger.

Task 6: Conlusion (10pts)

Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model.

We began by classifying coefficients for the predictor variable of Class, to which is a binary variable of True and False meaning whether an individual is or is not a risk for credit failure. We saw multiple coefficients that had a significant impact upon Class, with Job.UnemployedUnskilled being one, having a negative impact on good credit. Meaning that if you are unemployed and unskilled, the likelihood of having good credit is low.

We then move onto the Training data set, where we work on finding the probable cut off point for equal weights between FP and FN, with higher emphasis right now on FN, due to it’s higher impact. We were able to come up with a 0.39 cut off point, that creates the lowest value of False Negatives, which we then plug the data into an ROC plot, to see how well this model can predict. With a value of 85%, this model can predict the data pretty Good. Moving onto the Testing data, we repeat the process, while still using the 0.39 cut off point, we can see that the AUC value is lower than the training data, which we can only predict the model fairly at most.

We then change the weight of the False Positive variable, meaning we have decided to put our importance on FP than FN, to which lowers the False positive rate, while increasing the False Negative rate, which can be seen on both the training and Testing data sets.

Homework4

Chance Jones

10/23/2024

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

Homework4

Chance Jones

10/23/2024

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as 2024 for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)