Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. THe response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: There are 62 columns and 1000 rows. In the dataset information page, it states that several variables are categorical (status of exisiting checking account, credit history, purpose, etc) so I assumed that we might have to re-categorize them as.factor but it appears that they have already been separated into their different levels so no re-categorization is necessary.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure.

colnames(GermanCredit)
##  [1] "Duration"                           "Amount"                            
##  [3] "InstallmentRatePercentage"          "ResidenceDuration"                 
##  [5] "Age"                                "NumberExistingCredits"             
##  [7] "NumberPeopleMaintenance"            "Telephone"                         
##  [9] "ForeignWorker"                      "Class"                             
## [11] "CheckingAccountStatus.lt.0"         "CheckingAccountStatus.0.to.200"    
## [13] "CheckingAccountStatus.gt.200"       "CreditHistory.NoCredit.AllPaid"    
## [15] "CreditHistory.ThisBank.AllPaid"     "CreditHistory.PaidDuly"            
## [17] "CreditHistory.Delay"                "Purpose.NewCar"                    
## [19] "Purpose.UsedCar"                    "Purpose.Furniture.Equipment"       
## [21] "Purpose.Radio.Television"           "Purpose.DomesticAppliance"         
## [23] "Purpose.Repairs"                    "Purpose.Education"                 
## [25] "Purpose.Retraining"                 "Purpose.Business"                  
## [27] "SavingsAccountBonds.lt.100"         "SavingsAccountBonds.100.to.500"    
## [29] "SavingsAccountBonds.500.to.1000"    "SavingsAccountBonds.gt.1000"       
## [31] "EmploymentDuration.lt.1"            "EmploymentDuration.1.to.4"         
## [33] "EmploymentDuration.4.to.7"          "EmploymentDuration.gt.7"           
## [35] "Personal.Male.Divorced.Seperated"   "Personal.Female.NotSingle"         
## [37] "Personal.Male.Single"               "OtherDebtorsGuarantors.None"       
## [39] "OtherDebtorsGuarantors.CoApplicant" "Property.RealEstate"               
## [41] "Property.Insurance"                 "Property.CarOther"                 
## [43] "OtherInstallmentPlans.Bank"         "OtherInstallmentPlans.Stores"      
## [45] "Housing.Rent"                       "Housing.Own"                       
## [47] "Job.UnemployedUnskilled"            "Job.UnskilledResident"             
## [49] "Job.SkilledEmployee"
summary(GermanCredit)
##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker     Class         CheckingAccountStatus.lt.0
##  Min.   :0.000   Mode :logical   Min.   :0.000             
##  1st Qu.:1.000   FALSE:300       1st Qu.:0.000             
##  Median :1.000   TRUE :700       Median :0.000             
##  Mean   :0.963                   Mean   :0.274             
##  3rd Qu.:1.000                   3rd Qu.:1.000             
##  Max.   :1.000                   Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00

Your observation: This dataset has 49 columns total. 48 columns contain numeric data and one column class contains binary data that categorizes a customer as good or bad.

3. Split the dataset into training and test set. Please use the random seed as 2023 for reproducibility.

set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
credit.train = GermanCredit[index,]
credit.test = GermanCredit[-index,]

Your observation: This dataset was split into training data (80% of the observations, 800 rows) and testing data (20% of the observations, 200 rows). Each dataset still contains 49 columns.

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

credit.glm0 <- glm(Class~., family="binomial", data=credit.train)

Your observation: We fit a logistic regression model on the training dataset. Because we’re using a logistic regression instead of a linear regression, we add family='binomial to our code.

2. Summarize the model and interpret the coefficients.

summary(credit.glm0)
## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = credit.train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         7.948e+00  1.620e+00   4.908 9.22e-07 ***
## Duration                           -2.465e-02  1.027e-02  -2.401  0.01636 *  
## Amount                             -1.206e-04  4.943e-05  -2.440  0.01467 *  
## InstallmentRatePercentage          -2.766e-01  9.796e-02  -2.823  0.00476 ** 
## ResidenceDuration                   4.616e-02  9.831e-02   0.469  0.63872    
## Age                                 1.982e-02  1.046e-02   1.896  0.05802 .  
## NumberExistingCredits              -2.741e-01  2.145e-01  -1.278  0.20142    
## NumberPeopleMaintenance            -1.388e-01  2.898e-01  -0.479  0.63190    
## Telephone                          -2.586e-01  2.242e-01  -1.153  0.24877    
## ForeignWorker                      -1.789e+00  8.309e-01  -2.153  0.03132 *  
## CheckingAccountStatus.lt.0         -1.944e+00  2.646e-01  -7.347 2.02e-13 ***
## CheckingAccountStatus.0.to.200     -1.278e+00  2.551e-01  -5.009 5.46e-07 ***
## CheckingAccountStatus.gt.200       -5.367e-01  4.445e-01  -1.208  0.22724    
## CreditHistory.NoCredit.AllPaid     -1.284e+00  4.801e-01  -2.674  0.00750 ** 
## CreditHistory.ThisBank.AllPaid     -1.436e+00  4.997e-01  -2.873  0.00407 ** 
## CreditHistory.PaidDuly             -7.179e-01  2.865e-01  -2.506  0.01221 *  
## CreditHistory.Delay                -5.630e-01  3.726e-01  -1.511  0.13081    
## Purpose.NewCar                     -1.917e+00  8.668e-01  -2.212  0.02697 *  
## Purpose.UsedCar                    -2.727e-01  8.931e-01  -0.305  0.76006    
## Purpose.Furniture.Equipment        -1.069e+00  8.737e-01  -1.223  0.22118    
## Purpose.Radio.Television           -1.054e+00  8.812e-01  -1.196  0.23171    
## Purpose.DomesticAppliance          -1.109e+00  1.220e+00  -0.909  0.36321    
## Purpose.Repairs                    -1.992e+00  1.035e+00  -1.924  0.05433 .  
## Purpose.Education                  -1.896e+00  9.500e-01  -1.996  0.04595 *  
## Purpose.Retraining                 -1.045e+00  1.507e+00  -0.694  0.48796    
## Purpose.Business                   -1.240e+00  8.975e-01  -1.381  0.16721    
## SavingsAccountBonds.lt.100         -9.516e-01  2.975e-01  -3.199  0.00138 ** 
## SavingsAccountBonds.100.to.500     -7.571e-01  3.877e-01  -1.953  0.05083 .  
## SavingsAccountBonds.500.to.1000    -3.102e-01  5.274e-01  -0.588  0.55639    
## SavingsAccountBonds.gt.1000        -2.349e-01  5.947e-01  -0.395  0.69284    
## EmploymentDuration.lt.1             2.255e-01  4.925e-01   0.458  0.64711    
## EmploymentDuration.1.to.4           2.978e-01  4.682e-01   0.636  0.52473    
## EmploymentDuration.4.to.7           8.561e-01  5.057e-01   1.693  0.09045 .  
## EmploymentDuration.gt.7             3.178e-01  4.724e-01   0.673  0.50108    
## Personal.Male.Divorced.Seperated   -5.419e-01  4.982e-01  -1.088  0.27668    
## Personal.Female.NotSingle          -2.182e-01  3.492e-01  -0.625  0.53197    
## Personal.Male.Single                2.917e-01  3.523e-01   0.828  0.40770    
## OtherDebtorsGuarantors.None        -7.453e-01  4.707e-01  -1.583  0.11339    
## OtherDebtorsGuarantors.CoApplicant -1.243e+00  6.380e-01  -1.948  0.05138 .  
## Property.RealEstate                 8.035e-01  4.647e-01   1.729  0.08381 .  
## Property.Insurance                  6.041e-01  4.511e-01   1.339  0.18050    
## Property.CarOther                   4.111e-01  4.378e-01   0.939  0.34776    
## OtherInstallmentPlans.Bank         -5.736e-01  2.706e-01  -2.120  0.03401 *  
## OtherInstallmentPlans.Stores       -4.597e-01  4.649e-01  -0.989  0.32276    
## Housing.Rent                       -5.839e-01  5.256e-01  -1.111  0.26656    
## Housing.Own                        -7.262e-02  4.909e-01  -0.148  0.88240    
## Job.UnemployedUnskilled             9.950e-01  8.532e-01   1.166  0.24352    
## Job.UnskilledResident               1.006e-01  3.978e-01   0.253  0.80027    
## Job.SkilledEmployee                 1.195e-02  3.242e-01   0.037  0.97060    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 980.75  on 799  degrees of freedom
## Residual deviance: 717.58  on 751  degrees of freedom
## AIC: 815.58
## 
## Number of Fisher Scoring iterations: 5

Your observation: Out of all our variables, only 14 are significant at the 0.05 level.

Task 3: Optimal Probability Cut-off, with weight0 = 1 and weight1 = ### 1.

1. Use the training set to predict probabilities.

pred.glm0.train <- predict(credit.glm0,type="response")
hist(pred.glm0.train)

Your observation: This histogram shows a left-skewed distribution, meaning that probability increases as you move toward 1.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight_FN = 1   # define the weight for "true=1 but pred=0" (False Neg)
    weight_FP = 1    # define the weight for "true=0 but pred=1" (False Pos)
    # increase weight for more important category - FN or FP
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (False Neg)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (False Pos)
    cost = mean(weight_FN*c1 + weight_FP*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 

mean_error = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    mean_error[i] = costfunc(obs = credit.train$Class, pred.p = pred.glm0.train, pcut = p.seq[i])  
} # end of the loop

# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, mean_error)

MR<- mean(credit.train$Class!=credit.train)
print(MR)
## [1] 0.6246939

Your observation: In this model, the optimal probability cut-off point is around 0.42, and the misclassification rate is about 62%.

Task 4: Model Evaluation

1. Using the optimal probability cut-off point obtained in ### 3.2, generate confusion matrix and obtain MR for the the training set.

# get binary prediction
class.train<- (pred.glm0.train> 0.42)*1
# get confusion matrix
table(credit.train$Class, class.train, dnn = c("True", "Predicted"))
##        Predicted
## True      0   1
##   FALSE 112 130
##   TRUE   37 521
# (equal-weighted) misclassification rate
MR_train<- mean(credit.train$Class!=class.train)
MR_train
## [1] 0.20875

Your observation: When we set our optimal probability cut-off point at 0.42, our misclassification rate falls to 29%.

2. Using the optimal probability cut-off point obtained in ### 3.2, generate the ROC curve and calculate the AUC for the training set.

#install.packages('ROCR')
library(ROCR)
pred <- prediction(pred.glm0.train, credit.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.835829

Your observation: In our training set, the AUC is 0.84. If an AUC of 0.50 is random chance and an AUC of 1.0 is perfect prediction, we know that an AUC of 0.84 is pretty good!

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

pred.glm0.test <- predict(credit.glm0, newdata = credit.test, type="response")
hist(pred.glm0.test)

# get binary prediction
class.test <- (pred.glm0.test> 0.42)*1
# get confusion matrix
table(credit.test$Class, class.test, dnn = c("True", "Predicted"))
##        Predicted
## True      0   1
##   FALSE  25  33
##   TRUE   14 128
# (equal-weighted) misclassification rate
MR_test<- mean(credit.test$Class!=class.test)
MR_test
## [1] 0.235

Your observation: Our histogram is still left-skewed, but the y-axis frequency has fallen from 200 in the training set to 70 in the test set. Our misclassification rate is down to 23.5%.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

pred_test <- prediction(pred.glm0.test, credit.test$Class)
perf_test <- performance(pred, "tpr", "fpr")
plot(perf_test, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred_test, "auc"), "y.values"))
## [1] 0.8091306

Your observation: With our test set, our AUC is 0.81 indicating that it doesn’t perform quite as well as our training set.

Task 5: Using different weights

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be ### 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight_FN = 1   # define the weight for "true=1 but pred=0" (False Neg)
    weight_FP = 5    # define the weight for "true=0 but pred=1" (False Pos)
    # increase weight for more important category - FN or FP
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (False Neg)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (False Pos)
    cost = mean(weight_FN*c1 + weight_FP*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 

mean_error = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    mean_error[i] = costfunc(obs = credit.train$Class, pred.p = pred.glm0.train, pcut = p.seq[i])  
} # end of the loop

# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, mean_error)

Your observation: When we adjust our weights to give more importance on False Positives, our optimal cut-off point changes to somewhere around 0.79.

2. Obtain the confusion matrix and MR for the training set.

# get binary prediction
class.train_weight5 <- (pred.glm0.train> 0.79)*1
# get confusion matrix
table(credit.train$Class, class.train, dnn = c("True", "Predicted"))
##        Predicted
## True      0   1
##   FALSE 112 130
##   TRUE   37 521
# (equal-weighted) misclassification rate
MR_train_weight5 <- mean(credit.train$Class!=class.train_weight5)
MR_train
## [1] 0.20875

Your observation: With the adjusted weight, our new misclassification rate increases to 29%, but we decreased our False Positives from 130 in our training set with equal weight down to 34 FP in after adjusting the weights.

3. Obtain the confusion matrix and MR for the test set.

# get binary prediction
class.test_weight5 <- (pred.glm0.test> 0.79)*1
# get confusion matrix
table(credit.test$Class, class.test_weight5, dnn = c("True", "Predicted"))
##        Predicted
## True     0  1
##   FALSE 49  9
##   TRUE  51 91
# (equal-weighted) misclassification rate
MR_test <- mean(credit.test$Class!=class.test_weight5)
MR_test
## [1] 0.3

Your observation: Our testing dataset shows even few False Positives, now down to 9 observations. Our misclassifcation rate has stayed fairly steady, only increasing slightly from 29% to 30%.

Task 6: Report

Summarize your findings, including the optimal probability cut-off, MR and AUC (if calculated) for both in-sample and out-of-sample data. Discuss what you observed and make some suggestions on how can we improve the model.

Summary: Our in-sample MR is 0.29 and the AUC is 0.835. Our out-of-sample MR is 0.235 and the AUC is 0.809. When we adjust our weights, our in-smaple MR stays at 0.29 and the out-of-sample MR is 0.30.

If our goal is to give more importance to reducing False Positives, our best model is when we adjust the weights and run it on our testing dataset. To imporve the model, we could look at fine tuning the cut-off points (for example, run it again with a cut off of 0.795, 0.78, or 0.80 to see if we can get better results.) We could also try removing the non-signifcant variables to make the model more parsimonious while still providing accurate probability prediction.