Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

if (!require(package_name, quietly = TRUE)) {
    install.packages("caret")
}

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: There are 1000 observations and 62 variables. The data contains information that classifies people as either good or bad credit risks.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure.

(1) How many observations and variables are there? (2 pts)
str(GermanCredit)
## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...
head(GermanCredit)
##   Duration Amount InstallmentRatePercentage ResidenceDuration Age
## 1        6   1169                         4                 4  67
## 2       48   5951                         2                 2  22
## 3       12   2096                         2                 3  49
## 4       42   7882                         2                 4  45
## 5       24   4870                         3                 4  53
## 6       36   9055                         2                 4  35
##   NumberExistingCredits NumberPeopleMaintenance Telephone ForeignWorker Class
## 1                     2                       1         0             1  TRUE
## 2                     1                       1         1             1 FALSE
## 3                     1                       2         1             1  TRUE
## 4                     1                       2         1             1  TRUE
## 5                     2                       2         1             1 FALSE
## 6                     1                       2         0             1  TRUE
##   CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 1                          1                              0
## 2                          0                              1
## 3                          0                              0
## 4                          1                              0
## 5                          1                              0
## 6                          0                              0
##   CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 1                            0                              0
## 2                            0                              0
## 3                            0                              0
## 4                            0                              0
## 5                            0                              0
## 6                            0                              0
##   CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
## 1                              0                      0                   0
## 2                              0                      1                   0
## 3                              0                      0                   0
## 4                              0                      1                   0
## 5                              0                      0                   1
## 6                              0                      1                   0
##   Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 1              0               0                           0
## 2              0               0                           0
## 3              0               0                           0
## 4              0               0                           1
## 5              1               0                           0
## 6              0               0                           0
##   Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 1                        1                         0               0
## 2                        1                         0               0
## 3                        0                         0               0
## 4                        0                         0               0
## 5                        0                         0               0
## 6                        0                         0               0
##   Purpose.Education Purpose.Retraining Purpose.Business
## 1                 0                  0                0
## 2                 0                  0                0
## 3                 1                  0                0
## 4                 0                  0                0
## 5                 0                  0                0
## 6                 1                  0                0
##   SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 1                          0                              0
## 2                          1                              0
## 3                          1                              0
## 4                          1                              0
## 5                          1                              0
## 6                          0                              0
##   SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 1                               0                           0
## 2                               0                           0
## 3                               0                           0
## 4                               0                           0
## 5                               0                           0
## 6                               0                           0
##   EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## 1                       0                         0                         0
## 2                       0                         1                         0
## 3                       0                         0                         1
## 4                       0                         0                         1
## 5                       0                         1                         0
## 6                       0                         1                         0
##   EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## 1                       1                                0
## 2                       0                                0
## 3                       0                                0
## 4                       0                                0
## 5                       0                                0
## 6                       0                                0
##   Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## 1                         0                    1                           1
## 2                         1                    0                           1
## 3                         0                    1                           1
## 4                         0                    1                           0
## 5                         0                    1                           1
## 6                         0                    1                           1
##   OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## 1                                  0                   1                  0
## 2                                  0                   1                  0
## 3                                  0                   1                  0
## 4                                  0                   0                  1
## 5                                  0                   0                  0
## 6                                  0                   0                  0
##   Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 1                 0                          0                            0
## 2                 0                          0                            0
## 3                 0                          0                            0
## 4                 0                          0                            0
## 5                 0                          0                            0
## 6                 0                          0                            0
##   Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## 1            0           1                       0                     0
## 2            0           1                       0                     0
## 3            0           1                       0                     1
## 4            0           0                       0                     0
## 5            0           0                       0                     0
## 6            0           0                       0                     1
##   Job.SkilledEmployee
## 1                   1
## 2                   1
## 3                   0
## 4                   1
## 5                   1
## 6                   0
summary(GermanCredit)
##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker     Class         CheckingAccountStatus.lt.0
##  Min.   :0.000   Mode :logical   Min.   :0.000             
##  1st Qu.:1.000   FALSE:300       1st Qu.:0.000             
##  Median :1.000   TRUE :700       Median :0.000             
##  Mean   :0.963                   Mean   :0.274             
##  3rd Qu.:1.000                   3rd Qu.:1.000             
##  Max.   :1.000                   Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00
?GermanCredit

Your observation: With the above variables dropped, there are 49 variables with 1000 observations each. There are 7 integer type variables, 41 numeric type variables, and 1 logical variable (class). Overall, the data set includes predictors like checking account status, employment duration, age, and other personal information, which may shape an individual’s credit worthiness.

(2) Please make a frequency table of variable class (use table() function). How many observations are classed as “good” and how many are “bad”? (2 pts)
table(GermanCredit$Class)
## 
## FALSE  TRUE 
##   300   700

Your observation: Since TRUE corresponds with “good” and FALSE corresponds with “bad”, there are 700 observations that are good and 300 that are bad.

(3) Please make a barplot of of response variable class. Please add titles and labels to axis. (2 pts)
library(ggplot2)

ggplot(GermanCredit, aes(x = factor(Class, labels = c("Bad", "Good")))) +
  geom_bar() +
  labs(title = "Frequency of Credit Classes (Bad vs. Good)",
       x = "Credit Class",
       y = "Count")

3. Split the dataset into training and test set. A random seed of 2025 is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

set.seed(2025) # set random seed for reproducibility.

# split dataset into 80% training and 20% testing
train_ind <-sample(1:nrow(GermanCredit), 0.8 * nrow(GermanCredit))

# creating training and testing datasets
GermanCredit_train <- GermanCredit[train_ind, ]
GermanCredit_test <- GermanCredit[-train_ind,]

# test dimensions
dim(GermanCredit_train)
## [1] 800  49
dim(GermanCredit_test)
## [1] 200  49
# how many "good" and "bad" risks in data:
table(GermanCredit_train$Class)
## 
## FALSE  TRUE 
##   246   554
table(GermanCredit_test$Class)
## 
## FALSE  TRUE 
##    54   146

Your comment: I chose to randomly select 80% of the rows for training data. The other 20% was used for testing data. The set seed ensures that anyone else running the code will get the same split each time. In the training data, there are 246 bad risks and 554 good risks. In the testing data, there are 54 bad risks and 146 good risks.

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right. (2 pts)

# fitting logistic model
# note: I used Class as response variable since this indicates credit risk (good or bad)

model_GermanCredit <- glm(Class ~ ., data = GermanCredit_train, family = "binomial")

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable InstallmentRatePercentage? Is it significant, and why? (2 pts)

# examining fitted model:
summary(model_GermanCredit)
## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = GermanCredit_train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         8.815e+00  1.594e+00   5.529 3.22e-08 ***
## Duration                           -2.643e-02  1.027e-02  -2.573 0.010068 *  
## Amount                             -1.355e-04  4.957e-05  -2.734 0.006253 ** 
## InstallmentRatePercentage          -2.925e-01  9.747e-02  -3.001 0.002693 ** 
## ResidenceDuration                  -4.893e-02  9.689e-02  -0.505 0.613531    
## Age                                 1.355e-02  1.020e-02   1.329 0.183763    
## NumberExistingCredits              -3.599e-01  2.175e-01  -1.654 0.098027 .  
## NumberPeopleMaintenance            -3.644e-01  2.712e-01  -1.344 0.178951    
## Telephone                          -2.654e-01  2.276e-01  -1.166 0.243564    
## ForeignWorker                      -1.454e+00  7.181e-01  -2.025 0.042862 *  
## CheckingAccountStatus.lt.0         -1.775e+00  2.635e-01  -6.735 1.64e-11 ***
## CheckingAccountStatus.0.to.200     -1.393e+00  2.596e-01  -5.365 8.09e-08 ***
## CheckingAccountStatus.gt.200       -9.145e-01  4.225e-01  -2.165 0.030425 *  
## CreditHistory.NoCredit.AllPaid     -1.232e+00  4.874e-01  -2.528 0.011465 *  
## CreditHistory.ThisBank.AllPaid     -1.570e+00  4.829e-01  -3.252 0.001147 ** 
## CreditHistory.PaidDuly             -9.374e-01  2.907e-01  -3.224 0.001262 ** 
## CreditHistory.Delay                -5.626e-01  3.728e-01  -1.509 0.131283    
## Purpose.NewCar                     -1.417e+00  8.247e-01  -1.718 0.085814 .  
## Purpose.UsedCar                     2.976e-01  8.684e-01   0.343 0.731867    
## Purpose.Furniture.Equipment        -5.842e-01  8.330e-01  -0.701 0.483148    
## Purpose.Radio.Television           -4.791e-01  8.311e-01  -0.576 0.564284    
## Purpose.DomesticAppliance          -7.999e-01  1.209e+00  -0.661 0.508314    
## Purpose.Repairs                    -1.601e+00  1.011e+00  -1.585 0.113077    
## Purpose.Education                  -1.304e+00  9.075e-01  -1.436 0.150873    
## Purpose.Retraining                  3.610e-01  1.500e+00   0.241 0.809867    
## Purpose.Business                   -6.988e-01  8.558e-01  -0.817 0.414159    
## SavingsAccountBonds.lt.100         -1.065e+00  3.092e-01  -3.444 0.000573 ***
## SavingsAccountBonds.100.to.500     -8.837e-01  3.957e-01  -2.233 0.025519 *  
## SavingsAccountBonds.500.to.1000    -9.486e-01  5.005e-01  -1.896 0.058026 .  
## SavingsAccountBonds.gt.1000         1.460e-01  5.940e-01   0.246 0.805829    
## EmploymentDuration.lt.1             9.279e-02  4.634e-01   0.200 0.841281    
## EmploymentDuration.1.to.4           2.090e-01  4.443e-01   0.470 0.638077    
## EmploymentDuration.4.to.7           9.869e-01  4.882e-01   2.021 0.043232 *  
## EmploymentDuration.gt.7             1.785e-01  4.568e-01   0.391 0.695924    
## Personal.Male.Divorced.Seperated   -1.288e-01  5.011e-01  -0.257 0.797124    
## Personal.Female.NotSingle          -2.033e-01  3.478e-01  -0.584 0.558947    
## Personal.Male.Single                3.802e-01  3.481e-01   1.092 0.274710    
## OtherDebtorsGuarantors.None        -7.408e-01  4.667e-01  -1.587 0.112448    
## OtherDebtorsGuarantors.CoApplicant -1.220e+00  6.317e-01  -1.931 0.053423 .  
## Property.RealEstate                 3.867e-01  4.875e-01   0.793 0.427607    
## Property.Insurance                  1.635e-01  4.784e-01   0.342 0.732464    
## Property.CarOther                   1.796e-01  4.596e-01   0.391 0.695959    
## OtherInstallmentPlans.Bank         -7.233e-01  2.617e-01  -2.764 0.005716 ** 
## OtherInstallmentPlans.Stores       -1.929e-01  4.246e-01  -0.454 0.649608    
## Housing.Rent                       -4.468e-01  5.415e-01  -0.825 0.409294    
## Housing.Own                        -2.017e-01  5.184e-01  -0.389 0.697127    
## Job.UnemployedUnskilled             5.206e-01  7.212e-01   0.722 0.470322    
## Job.UnskilledResident              -2.015e-02  3.833e-01  -0.053 0.958079    
## Job.SkilledEmployee                -6.640e-02  3.153e-01  -0.211 0.833214    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 987.34  on 799  degrees of freedom
## Residual deviance: 728.49  on 751  degrees of freedom
## AIC: 826.49
## 
## Number of Fisher Scoring iterations: 5

Your comment: Some predictors in the model are not significant since they have a p value greater than 0.05 (ex: residence duration, telephone, housing). Here are my interpretations of the first several variables in the model: - Duration: As duration increases, class is more likely to be good. - Amount: As loan amount decreases, class is less likely to be good. Installment rate has an estimate of -0.2925, indicating that as the installment rate increases by 1, the log-odds of having good credit worthiness decrease by 0.2925 on average, provided that other variables are held constant. In other words, a higher installation rate reduces the odds of being identified as a good credit risk. The level of significance for this variable is 0.002693; because this is lower than the threshold of 0.05, this suggests that installment rate is statistically significant. In summary, having more income go towards installment payments generally makes people less likely to have good credit.

3. Please interpret this number in detail (please calculate the corresponding odds ratio, and interpret it). (2 pts)

# Calculate odds ratio for a specific coefficient (example: InstallmentRatePercentage)
exp(-0.2925)
## [1] 0.7463952
# Show change in odds ratio
exp(-0.2925)-1
## [1] -0.2536048

Your comment: The odds ratio is 0.7464, which indicates that as the installment rate percentage increases by 1 unit, the odds of having good credit decrease by ~25% when all other variables are held constant.

Task 3: Model Evaluation (Part I)

1. Use the training set to obtain predicted probabilities. (2 pts)

# predicted probabilities on training set:
pred_prob_train <- predict(model_GermanCredit, newdata = GermanCredit_train, type = "response")

# summarize predicted probabilities
summary(pred_prob_train)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05173 0.50798 0.76807 0.69250 0.91225 0.99919

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (3 pts)

actual_value_train <- GermanCredit_train$Class * 1 # * 1 converts TRUE/FALSE to numeric 1/0
pred_value_train <- 1 * (pred_prob_train > 0.5)
confusion_matrix_train <- table(actual_value_train, pred_value_train)

# false positives and false negatives
FP_train <- sum(actual_value_train == 0 & pred_value_train == 1)
FN_train <- sum(actual_value_train == 1 & pred_value_train == 0)

# misclassification rate - 2 ways
MR_train1 <- (FP_train + FN_train) / sum(confusion_matrix_train)
MR_train2 <- 1 - sum(diag(confusion_matrix_train))/sum(confusion_matrix_train)

# showing confusion matrix and misclassification rate
confusion_matrix_train
##                   pred_value_train
## actual_value_train   0   1
##                  0 134 112
##                  1  62 492
MR_train1
## [1] 0.2175
MR_train2
## [1] 0.2175
# accuracy
sum(diag(confusion_matrix_train)) / sum(confusion_matrix_train)
## [1] 0.7825

Your comment: Based on the confusion matrix, there are 134 true negatives (model predicted bad credit and the individual actually had bad credit), 112 false positives (the model predicted good credit but the individual actually had bad credit), 62 false negatives (the model predicted bad credit but the individual actually had good credit), and 492 true positives (the model predicted good credit and the individual actually had good credit). I also computed the accuracy rate, which indicates that 78.25% of the training cases were predicted correctly. The misclassification rate is 0.2175, which represents the number of false postives plus the false negatives divided by the total number of observations. This indicates that 21.75% of the predictions made by the model are incorrect.

3. Find the optimal probability cut-off point using the MR. Please draw a plot of MR vs. cut-off probability, and comment on optimal cut-off probability. (3 pts)

# note: used training data for this

# create a sequence of cut-off probabilities (go from 0 to 1 by increments of 0.01)
pcut_seq <- seq(from = 0, to = 1, by = 0.01)

# vector to store MR rate
MR_seq <- rep(0, length(pcut_seq))

for(i in 1:length(pcut_seq)) {
  pcut <- pcut_seq[i]
  pred_value <- 1 * (pred_prob_train > pcut) # converts predicted probabilities to binary predictions
  confusion_matrix_train <- table(actual = actual_value_train, predict = pred_value)
  FP_train <- sum(actual_value_train == 0 & pred_value == 1)  # false positives
  FN_train <- sum(actual_value_train == 1 & pred_value == 0)  # false negatives
  MR_seq[i] <- (FP_train + FN_train) / length(pred_value) # misclassification rate
}
plot(MR_seq ~ pcut_seq)

Your comment: The misclassification rate decreases as the cut-off probability changes. It reaches its lowest point near a cut-off around ~0.5. This indicates that this cut-off could lead to the best classification performance for the training data.

4. Please generate the ROC curve and calculate the AUC for the training set. Please comment on this AUC. (2 pts)

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# generate ROC curve
roc_obj_train <- roc(actual_value_train, pred_prob_train)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj_train)

# calculate AUC
auc_train <- auc(actual_value_train, pred_prob_train)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# view AUC
auc_train
## Area under the curve: 0.8312

Your comment: The area under the curve is 0.8312, which represents an aggregate measure of the model’s performance across all possible classification thresholds. In this case, it specifically indicates the model’s ability to differentiate between good and bad credit worthiness across different cut-off probabilities. An AUC of 1 indicates a perfect model, while an AUC of 0.5 indicates a model that is no better than random guessing. Since this model’s AUC falls between 0.5 and 1, it indicates a fairly strong model.

Task 4: Model Evaluation (Part II)

1. Use the testing set to obtain predicted probabilities. (2 pts)

pred_prob_test <- predict(model_GermanCredit, newdata = GermanCredit_test, type = "response")
summary(pred_prob_test)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1150  0.6116  0.7922  0.7398  0.9312  0.9956

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (2 pts)

Note: I calculated the confusion matrix and MR for the testing set here, since I already generated these values for the training set above.

actual_value_test <- GermanCredit_test$Class * 1 # * 1 converts TRUE/FALSE to numeric 1/0
pred_value_test <- 1 * (pred_prob_test > 0.5)
confusion_matrix_test <- table(actual_value_test, pred_value_test)

# false positives and false negatives
FP_test <- sum(actual_value_test == 0 & pred_value_test == 1)
FN_test <- sum(actual_value_test == 1 & pred_value_test == 0)

# misclassification rate - 2 ways
MR_test1 <- (FP_test + FN_test) / sum(confusion_matrix_test)
MR_test2 <- 1 - sum(diag(confusion_matrix_test))/sum(confusion_matrix_test)

# showing confusion matrix and misclassification rate
confusion_matrix_test
##                  pred_value_test
## actual_value_test   0   1
##                 0  21  33
##                 1  12 134
MR_test1
## [1] 0.225
MR_test2
## [1] 0.225
# accuracy

Your comment: Based on the confusion matrix, there are 21 true negatives (model predicted bad credit and the individual actually had bad credit), 33 false positives (the model predicted good credit but the individual actually had bad credit), 12 false negatives (the model predicted bad credit but the individual actually had good credit), and 134 true positives (the model predicted good credit and the individual actually had good credit). I also computed the accuracy rate, which indicates that 78.25% of the test cases were predicted correctly. The misclassification rate is 0.225, which represents the number of false positives plus the false negatives divided by the total number of observations. This means that 22.5% of the predictions made by the model are incorrect. This test model has a slightly higher misclassification rate than the training model.

2. Please generate the ROC curve and calculate the AUC for the test set. Please comment on this AUC. (2 pts)

# generate ROC curve
roc_obj_test <- roc(actual_value_test, pred_prob_test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj_test)

# calculate AUC
auc_test <- auc(actual_value_test, pred_prob_test)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# view AUC
auc_test
## Area under the curve: 0.8182

Comment: The AUC for the test set is 0.6533, which represents an aggregate measure of the model’s performance across all possible classification thresholds. In this case, it specifically indicates the model’s ability to differentiate between good and bad credit worthiness across different cut-off probabilities. An AUC of 1 indicates a perfect model, while an AUC of 0.5 indicates a model that is no better than random guessing. Since this model’s AUC falls between 0.5 and 1, it indicates a moderately strong predictive model, though not as strong as the model for the training set.