Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: I can see that German Credit has 1000 observations and 62 variables. These include numeric and binary variables that are made from categorical variables.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure. (10pts)

str(GermanCredit)

## 'data.frame':    1000 obs. of  49 variables:
##  $ Duration                          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                            : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage         : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                 : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits             : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance           : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                         : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                             : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0        : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200    : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.NoCredit.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly            : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay               : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Purpose.NewCar                    : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment       : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television          : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                 : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Retraining                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100        : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000       : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ EmploymentDuration.lt.1           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4         : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7         : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Personal.Male.Divorced.Seperated  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle         : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single              : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ OtherDebtorsGuarantors.None       : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Property.RealEstate               : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                 : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ OtherInstallmentPlans.Bank        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Housing.Rent                      : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                       : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Job.UnemployedUnskilled           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident             : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee               : num  1 1 0 1 1 0 1 0 0 0 ...

summary(GermanCredit)

##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker     Class         CheckingAccountStatus.lt.0
##  Min.   :0.000   Mode :logical   Min.   :0.000             
##  1st Qu.:1.000   FALSE:300       1st Qu.:0.000             
##  Median :1.000   TRUE :700       Median :0.000             
##  Mean   :0.963                   Mean   :0.274             
##  3rd Qu.:1.000                   3rd Qu.:1.000             
##  Max.   :1.000                   Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00

Your observation: After dropping variables that provide no information in the data, the total observations remain 1000 and there are 49 variables.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

set.seed(2024)

# Create index for training data (70% training is standard)
train_index <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)

# Split the data
train_data <- GermanCredit[train_index, ]
test_data  <- GermanCredit[-train_index, ]

Your observation: This dataset was split into training and tests sets by using a 70/30 split. A random seed of 2024 was used for reproducibility. The createDataPartition() function was used to perform a stratified split based on the Class variable.

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

# Make sure Class is a factor
train_data$Class <- as.factor(train_data$Class)

# Fit logistic regression model
log_model <- glm(Class ~ ., data = train_data, family = binomial)

summary(log_model)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = train_data)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.7755921  1.7594925   5.556 2.76e-08 ***
## Duration                           -0.0281752  0.0114559  -2.459 0.013915 *  
## Amount                             -0.0001968  0.0000580  -3.394 0.000690 ***
## InstallmentRatePercentage          -0.3458012  0.1122102  -3.082 0.002058 ** 
## ResidenceDuration                  -0.1477247  0.1099835  -1.343 0.179222    
## Age                                -0.0011930  0.0111092  -0.107 0.914479    
## NumberExistingCredits              -0.1741853  0.2247245  -0.775 0.438277    
## NumberPeopleMaintenance            -0.2953842  0.3033517  -0.974 0.330188    
## Telephone                          -0.8357009  0.2619015  -3.191 0.001418 ** 
## ForeignWorker                      -1.6606566  0.8122576  -2.044 0.040905 *  
## CheckingAccountStatus.lt.0         -2.0280291  0.2899845  -6.994 2.68e-12 ***
## CheckingAccountStatus.0.to.200     -1.4706478  0.2943908  -4.996 5.87e-07 ***
## CheckingAccountStatus.gt.200       -0.6052653  0.4876931  -1.241 0.214577    
## CreditHistory.NoCredit.AllPaid     -1.2639798  0.5155113  -2.452 0.014211 *  
## CreditHistory.ThisBank.AllPaid     -1.8780235  0.5706646  -3.291 0.000999 ***
## CreditHistory.PaidDuly             -0.8775997  0.3159046  -2.778 0.005469 ** 
## CreditHistory.Delay                -0.4012640  0.4307837  -0.931 0.351608    
## Purpose.NewCar                     -1.0626620  0.8142904  -1.305 0.191887    
## Purpose.UsedCar                     1.1942539  0.8839916   1.351 0.176702    
## Purpose.Furniture.Equipment        -0.1681192  0.8320966  -0.202 0.839883    
## Purpose.Radio.Television           -0.3031554  0.8286036  -0.366 0.714467    
## Purpose.DomesticAppliance          -0.7371787  1.2321421  -0.598 0.549646    
## Purpose.Repairs                    -0.8575710  0.9887784  -0.867 0.385776    
## Purpose.Education                  -0.6848705  0.9364025  -0.731 0.464544    
## Purpose.Retraining                 -0.1649183  1.5465838  -0.107 0.915079    
## Purpose.Business                   -0.3600823  0.8535288  -0.422 0.673116    
## SavingsAccountBonds.lt.100         -0.9786195  0.3127225  -3.129 0.001752 ** 
## SavingsAccountBonds.100.to.500     -0.9669534  0.4406228  -2.195 0.028198 *  
## SavingsAccountBonds.500.to.1000    -0.2529878  0.5442721  -0.465 0.642061    
## SavingsAccountBonds.gt.1000         0.2713176  0.6594268   0.411 0.680747    
## EmploymentDuration.lt.1            -0.4435735  0.5345880  -0.830 0.406681    
## EmploymentDuration.1.to.4          -0.4275141  0.5069023  -0.843 0.399013    
## EmploymentDuration.4.to.7           0.4416798  0.5618787   0.786 0.431822    
## EmploymentDuration.gt.7            -0.2520532  0.5037635  -0.500 0.616835    
## Personal.Male.Divorced.Seperated   -0.4301280  0.5538492  -0.777 0.437385    
## Personal.Female.NotSingle          -0.0179029  0.3950224  -0.045 0.963851    
## Personal.Male.Single                0.6299901  0.3971902   1.586 0.112713    
## OtherDebtorsGuarantors.None        -1.0309812  0.5142560  -2.005 0.044984 *  
## OtherDebtorsGuarantors.CoApplicant -1.0727811  0.7201303  -1.490 0.136302    
## Property.RealEstate                 1.2295999  0.5185315   2.371 0.017725 *  
## Property.Insurance                  0.8935212  0.5097800   1.753 0.079643 .  
## Property.CarOther                   1.1356001  0.5048681   2.249 0.024493 *  
## OtherInstallmentPlans.Bank         -0.6436463  0.3046547  -2.113 0.034626 *  
## OtherInstallmentPlans.Stores       -0.2405278  0.4731218  -0.508 0.611184    
## Housing.Rent                       -0.7041915  0.5817432  -1.210 0.226093    
## Housing.Own                        -0.5109041  0.5552490  -0.920 0.357502    
## Job.UnemployedUnskilled             0.4681174  0.8091298   0.579 0.562897    
## Job.UnskilledResident               0.3450109  0.4498926   0.767 0.443156    
## Job.SkilledEmployee                 0.1604813  0.3719210   0.431 0.666110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 855.21  on 699  degrees of freedom
## Residual deviance: 595.77  on 651  degrees of freedom
## AIC: 693.77
## 
## Number of Fisher Scoring iterations: 5

Your observation: A logistic regression model was built using the training data and all variables. The Class variable was converted to a factor so it could be used for classification. The glm() function with a binomial setting was used to predict credit risk.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

summary(log_model)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = train_data)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.7755921  1.7594925   5.556 2.76e-08 ***
## Duration                           -0.0281752  0.0114559  -2.459 0.013915 *  
## Amount                             -0.0001968  0.0000580  -3.394 0.000690 ***
## InstallmentRatePercentage          -0.3458012  0.1122102  -3.082 0.002058 ** 
## ResidenceDuration                  -0.1477247  0.1099835  -1.343 0.179222    
## Age                                -0.0011930  0.0111092  -0.107 0.914479    
## NumberExistingCredits              -0.1741853  0.2247245  -0.775 0.438277    
## NumberPeopleMaintenance            -0.2953842  0.3033517  -0.974 0.330188    
## Telephone                          -0.8357009  0.2619015  -3.191 0.001418 ** 
## ForeignWorker                      -1.6606566  0.8122576  -2.044 0.040905 *  
## CheckingAccountStatus.lt.0         -2.0280291  0.2899845  -6.994 2.68e-12 ***
## CheckingAccountStatus.0.to.200     -1.4706478  0.2943908  -4.996 5.87e-07 ***
## CheckingAccountStatus.gt.200       -0.6052653  0.4876931  -1.241 0.214577    
## CreditHistory.NoCredit.AllPaid     -1.2639798  0.5155113  -2.452 0.014211 *  
## CreditHistory.ThisBank.AllPaid     -1.8780235  0.5706646  -3.291 0.000999 ***
## CreditHistory.PaidDuly             -0.8775997  0.3159046  -2.778 0.005469 ** 
## CreditHistory.Delay                -0.4012640  0.4307837  -0.931 0.351608    
## Purpose.NewCar                     -1.0626620  0.8142904  -1.305 0.191887    
## Purpose.UsedCar                     1.1942539  0.8839916   1.351 0.176702    
## Purpose.Furniture.Equipment        -0.1681192  0.8320966  -0.202 0.839883    
## Purpose.Radio.Television           -0.3031554  0.8286036  -0.366 0.714467    
## Purpose.DomesticAppliance          -0.7371787  1.2321421  -0.598 0.549646    
## Purpose.Repairs                    -0.8575710  0.9887784  -0.867 0.385776    
## Purpose.Education                  -0.6848705  0.9364025  -0.731 0.464544    
## Purpose.Retraining                 -0.1649183  1.5465838  -0.107 0.915079    
## Purpose.Business                   -0.3600823  0.8535288  -0.422 0.673116    
## SavingsAccountBonds.lt.100         -0.9786195  0.3127225  -3.129 0.001752 ** 
## SavingsAccountBonds.100.to.500     -0.9669534  0.4406228  -2.195 0.028198 *  
## SavingsAccountBonds.500.to.1000    -0.2529878  0.5442721  -0.465 0.642061    
## SavingsAccountBonds.gt.1000         0.2713176  0.6594268   0.411 0.680747    
## EmploymentDuration.lt.1            -0.4435735  0.5345880  -0.830 0.406681    
## EmploymentDuration.1.to.4          -0.4275141  0.5069023  -0.843 0.399013    
## EmploymentDuration.4.to.7           0.4416798  0.5618787   0.786 0.431822    
## EmploymentDuration.gt.7            -0.2520532  0.5037635  -0.500 0.616835    
## Personal.Male.Divorced.Seperated   -0.4301280  0.5538492  -0.777 0.437385    
## Personal.Female.NotSingle          -0.0179029  0.3950224  -0.045 0.963851    
## Personal.Male.Single                0.6299901  0.3971902   1.586 0.112713    
## OtherDebtorsGuarantors.None        -1.0309812  0.5142560  -2.005 0.044984 *  
## OtherDebtorsGuarantors.CoApplicant -1.0727811  0.7201303  -1.490 0.136302    
## Property.RealEstate                 1.2295999  0.5185315   2.371 0.017725 *  
## Property.Insurance                  0.8935212  0.5097800   1.753 0.079643 .  
## Property.CarOther                   1.1356001  0.5048681   2.249 0.024493 *  
## OtherInstallmentPlans.Bank         -0.6436463  0.3046547  -2.113 0.034626 *  
## OtherInstallmentPlans.Stores       -0.2405278  0.4731218  -0.508 0.611184    
## Housing.Rent                       -0.7041915  0.5817432  -1.210 0.226093    
## Housing.Own                        -0.5109041  0.5552490  -0.920 0.357502    
## Job.UnemployedUnskilled             0.4681174  0.8091298   0.579 0.562897    
## Job.UnskilledResident               0.3450109  0.4498926   0.767 0.443156    
## Job.SkilledEmployee                 0.1604813  0.3719210   0.431 0.666110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 855.21  on 699  degrees of freedom
## Residual deviance: 595.77  on 651  degrees of freedom
## AIC: 693.77
## 
## Number of Fisher Scoring iterations: 5

Your observation: The model shows which variables affect credit risk, with smaller p-values indicating more important variables. The variable Amount is significant and has a negative coefficient, meaning that as the loan amount increases, the likelihood of being a good credit risk decreases, indicating higher risk.

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

# Get predicted probabilities for training data
train_probs <- predict(log_model, newdata = train_data, type = "response")
train_probs

##          1          2          3          4          5          7          8 
## 0.94848296 0.27796033 0.98221231 0.62636465 0.10671097 0.92238998 0.84618528 
##          9         10         13         14         15         16         18 
## 0.97464344 0.29520427 0.89639731 0.48124556 0.26183844 0.37328933 0.18355073 
##         19         20         23         26         27         28         29 
## 0.33817898 0.96535073 0.91592423 0.88704068 0.84251378 0.79908021 0.90161658 
##         30         31         32         34         35         36         37 
## 0.11509086 0.82665738 0.49668265 0.94312595 0.74340675 0.46396694 0.78401622 
##         38         39         40         41         42         43         44 
## 0.76316015 0.94528334 0.73722203 0.84356945 0.86475457 0.69292297 0.85923910 
##         48         49         50         51         52         54         55 
## 0.98501660 0.85538821 0.82703426 0.76140500 0.95987148 0.99641808 0.25491387 
##         56         57         59         62         63         65         66 
## 0.97350159 0.82771599 0.58803645 0.98516158 0.37131136 0.77367847 0.76290626 
##         67         68         70         71         72         73         75 
## 0.81940201 0.73052188 0.80972895 0.78221373 0.99313805 0.67454250 0.36950532 
##         77         78         80         81         82         84         86 
## 0.21962661 0.83854378 0.47954971 0.96050103 0.96501822 0.81606836 0.98331606 
##         87         88         90         91         93         94         95 
## 0.67777141 0.14093191 0.50279403 0.96279398 0.93298346 0.77511981 0.89050014 
##         96         98         99        100        101        102        103 
## 0.01516329 0.51328547 0.68826744 0.92134398 0.62280574 0.58369242 0.94350374 
##        104        105        106        107        110        111        112 
## 0.95934528 0.99294187 0.56799941 0.15997786 0.94907850 0.60755091 0.70162277 
##        113        115        116        117        119        120        121 
## 0.51948335 0.74046375 0.98063412 0.49912773 0.68104685 0.89444905 0.59170572 
##        123        124        125        126        127        129        130 
## 0.94156691 0.86381370 0.67749680 0.45282713 0.50625461 0.98230214 0.38530251 
##        132        133        134        136        138        139        140 
## 0.23073231 0.84914818 0.59185468 0.98834283 0.85150902 0.98448653 0.91929965 
##        144        153        154        159        162        163        165 
## 0.67016479 0.56875797 0.80840138 0.67435878 0.79629767 0.97458942 0.78519392 
##        166        168        169        170        171        174        177 
## 0.99237526 0.85522979 0.93042961 0.68128732 0.17803593 0.97587801 0.47978418 
##        178        179        180        181        183        185        186 
## 0.87121238 0.95471374 0.50914520 0.58679838 0.20234145 0.65211261 0.93476798 
##        187        189        190        191        192        195        197 
## 0.41059055 0.63695471 0.53325536 0.95712754 0.21663918 0.47002932 0.98933630 
##        198        200        202        203        204        205        206 
## 0.21891084 0.30483706 0.27715452 0.90839276 0.50155276 0.92420165 0.48160548 
##        207        208        209        210        211        213        214 
## 0.97421964 0.84956907 0.24139445 0.99951331 0.98952641 0.40067898 0.89013709 
##        217        218        219        220        221        222        223 
## 0.64898494 0.89401422 0.36646575 0.89342237 0.75640360 0.42770904 0.84441173 
##        226        228        231        233        234        236        237 
## 0.56279189 0.27219303 0.61041976 0.91418428 0.80686237 0.39188147 0.72743454 
##        239        240        241        243        245        246        247 
## 0.89165798 0.74474079 0.29160845 0.24148173 0.71523789 0.97945467 0.94736407 
##        248        249        250        251        252        255        256 
## 0.76533393 0.89637375 0.88064664 0.95824729 0.89989705 0.98826959 0.76736462 
##        257        259        260        263        264        265        266 
## 0.93486918 0.98449117 0.93256894 0.42518148 0.72486952 0.98492620 0.63769823 
##        267        269        270        271        272        275        276 
## 0.90285657 0.46919352 0.95282393 0.98548835 0.99277134 0.05176437 0.94987128 
##        277        278        279        280        281        282        283 
## 0.93487405 0.84005356 0.85624814 0.88508464 0.99329860 0.92054577 0.82718982 
##        285        286        287        288        289        290        291 
## 0.46587876 0.15444083 0.56361847 0.56002794 0.84413466 0.34379401 0.99291345 
##        293        295        296        297        299        300        301 
## 0.68736627 0.41589859 0.36772870 0.98213568 0.95211687 0.98818616 0.97240055 
##        302        304        305        306        307        308        311 
## 0.39979252 0.82820550 0.44945495 0.94454014 0.97609810 0.37337534 0.65317876 
##        312        313        315        316        318        319        321 
## 0.76700301 0.67812172 0.99155705 0.08943771 0.86455701 0.93910367 0.33707283 
##        324        326        327        329        330        331        332 
## 0.77229787 0.96934351 0.98729734 0.63802616 0.65240352 0.87074969 0.85789938 
##        334        336        337        338        340        341        342 
## 0.63977567 0.78297316 0.79511587 0.47539002 0.52472957 0.47186784 0.65897637 
##        343        344        345        347        348        349        350 
## 0.76101472 0.80609900 0.90949550 0.91737689 0.55978692 0.98108025 0.83501625 
##        351        352        353        354        355        356        357 
## 0.94182732 0.95953195 0.99575256 0.29909993 0.94702560 0.44304970 0.99677785 
##        359        360        364        366        367        368        369 
## 0.86623521 0.38121817 0.92369836 0.98645194 0.99448473 0.41780574 0.30421909 
##        370        371        372        374        376        379        380 
## 0.72418405 0.78290674 0.95035674 0.38468531 0.16643762 0.04850645 0.86089338 
##        381        382        383        384        385        387        388 
## 0.91513174 0.43267495 0.85013983 0.76102386 0.89111033 0.96178471 0.64896983 
##        390        392        394        395        396        398        399 
## 0.93663061 0.92068037 0.89566032 0.98179577 0.27530739 0.62640574 0.52284698 
##        400        401        402        403        405        407        408 
## 0.97427241 0.91824253 0.66723624 0.83624573 0.64764015 0.99767540 0.79360401 
##        409        411        412        414        415        418        419 
## 0.92029383 0.60784358 0.99024551 0.94360589 0.31856859 0.65234112 0.89777799 
##        420        421        423        424        426        427        428 
## 0.50879533 0.94868296 0.90159296 0.95634311 0.94147445 0.92010576 0.98270257 
##        430        431        432        433        434        436        438 
## 0.40888646 0.97922967 0.30572675 0.76254316 0.86131391 0.94859756 0.96478082 
##        439        440        441        442        443        445        446 
## 0.38183771 0.65973469 0.87745863 0.48866924 0.84334626 0.65707176 0.93478127 
##        447        448        449        450        451        452        453 
## 0.19776017 0.92609772 0.95688606 0.67870087 0.97659988 0.92839790 0.87108701 
##        456        457        459        461        462        463        464 
## 0.90123372 0.50490815 0.44404872 0.68896302 0.48957033 0.54736006 0.81853003 
##        466        468        472        473        474        475        476 
## 0.71365567 0.69600960 0.27536202 0.44074760 0.97935285 0.71375622 0.29743974 
##        480        483        485        486        488        490        492 
## 0.80107508 0.52930535 0.97783399 0.78187190 0.52010800 0.91802334 0.16333382 
##        493        494        495        496        498        499        500 
## 0.97051107 0.85660752 0.86273640 0.62018651 0.96121832 0.81924493 0.86617708 
##        501        503        504        505        506        507        509 
## 0.12557756 0.80322011 0.23608239 0.10007403 0.96280921 0.99757839 0.74431422 
##        515        516        517        518        522        523        526 
## 0.86442453 0.88114109 0.92808340 0.81073050 0.46862259 0.03872306 0.72560466 
##        528        529        530        531        532        533        534 
## 0.99234529 0.10405482 0.65602880 0.61520030 0.48810842 0.96062600 0.93461322 
##        535        537        538        539        541        543        544 
## 0.91678962 0.65824014 0.79892752 0.01288756 0.77381466 0.38219977 0.89744232 
##        546        547        548        551        552        553        558 
## 0.54737800 0.86061772 0.91150421 0.97108902 0.84042135 0.79257969 0.73495158 
##        559        563        564        565        567        568        569 
## 0.25034866 0.85207218 0.29583673 0.69491600 0.46073821 0.99097528 0.83683068 
##        570        572        573        574        576        577        578 
## 0.21658839 0.94866127 0.98276422 0.31754958 0.94513635 0.84094445 0.97509283 
##        579        581        583        584        585        586        587 
## 0.17198438 0.81841524 0.85300578 0.16350742 0.83342423 0.36681598 0.57289535 
##        588        590        593        594        595        596        597 
## 0.82340752 0.73466703 0.88536844 0.36596580 0.65765065 0.38746553 0.21613787 
##        598        601        604        606        607        610        611 
## 0.72477492 0.91640660 0.79959851 0.47381643 0.96115946 0.95204885 0.34469117 
##        612        616        620        622        623        624        625 
## 0.62857119 0.25056539 0.70363825 0.81544962 0.51802030 0.40029102 0.21576805 
##        626        628        629        630        631        632        637 
## 0.96189225 0.35797904 0.95708873 0.97434400 0.31400624 0.16495709 0.95445862 
##        638        640        641        642        644        647        649 
## 0.63065153 0.49937673 0.54005416 0.51704055 0.99060100 0.36487337 0.45347122 
##        650        652        653        654        655        656        657 
## 0.35400516 0.68451561 0.33916342 0.32179795 0.99595647 0.37351967 0.32341235 
##        658        659        661        662        663        664        666 
## 0.85997187 0.33673828 0.73561316 0.48668217 0.88757443 0.90423775 0.77995176 
##        667        669        670        671        672        673        674 
## 0.63831867 0.52146189 0.79305203 0.95433175 0.94901484 0.35265987 0.97267807 
##        675        676        677        678        680        684        685 
## 0.90451091 0.92025322 0.94918560 0.16416589 0.88367696 0.81515881 0.72407757 
##        686        687        688        689        690        692        693 
## 0.43827407 0.99252586 0.26420628 0.95781103 0.65683583 0.48027126 0.73336871 
##        694        695        696        697        698        699        700 
## 0.82197590 0.96994008 0.99043616 0.98523735 0.96653718 0.93943996 0.81122166 
##        701        702        704        706        708        710        711 
## 0.90405714 0.63151489 0.50514417 0.73632815 0.23735811 0.80499494 0.97084247 
##        712        713        714        715        716        717        718 
## 0.08939516 0.99275136 0.74637712 0.03303341 0.99014287 0.98483238 0.89447391 
##        720        723        724        725        727        729        730 
## 0.59470960 0.30216088 0.76465903 0.58128956 0.97400959 0.01428197 0.97111425 
##        732        734        737        739        742        743        744 
## 0.49848433 0.99218899 0.38751163 0.91066987 0.42172970 0.96323392 0.49076981 
##        746        747        748        749        750        752        754 
## 0.63868421 0.38337709 0.40513218 0.99041924 0.99281343 0.49301489 0.86319161 
##        755        756        760        763        764        765        767 
## 0.89808516 0.22253619 0.52846395 0.47314986 0.69684594 0.91496190 0.23214564 
##        768        769        771        772        775        776        777 
## 0.98894025 0.93075371 0.87435670 0.10229013 0.87795579 0.28923190 0.92604257 
##        778        779        780        781        784        785        786 
## 0.76253748 0.99360632 0.63772902 0.93975774 0.23407962 0.94804867 0.92399253 
##        788        789        790        791        792        793        795 
## 0.99130772 0.13829628 0.11558785 0.47875144 0.98274588 0.98867245 0.88147378 
##        796        798        799        800        801        802        803 
## 0.83741564 0.93315321 0.93422893 0.81185864 0.76041916 0.89374228 0.58253609 
##        805        808        809        811        813        814        815 
## 0.70364634 0.97435146 0.43790332 0.83135262 0.82812040 0.18231476 0.08420201 
##        816        817        818        820        821        822        824 
## 0.14695154 0.92339396 0.99349907 0.22618386 0.88626130 0.77315533 0.64046294 
##        825        827        828        829        830        832        833 
## 0.91824129 0.42681764 0.74962641 0.56623223 0.44105457 0.30950698 0.04893579 
##        834        835        836        837        838        839        840 
## 0.81286325 0.82793019 0.12268755 0.93828291 0.94377338 0.90390221 0.86255447 
##        841        842        843        844        845        846        847 
## 0.39991010 0.98011072 0.64370087 0.83118491 0.73356977 0.96783152 0.77852252 
##        848        849        850        852        854        855        857 
## 0.65955376 0.78941256 0.56774033 0.99775733 0.11662391 0.66146923 0.98532646 
##        858        859        860        861        862        863        864 
## 0.95425833 0.30913643 0.98929406 0.99026524 0.92995203 0.41886302 0.97994878 
##        865        867        868        869        870        872        873 
## 0.92055984 0.29607471 0.95194278 0.80818518 0.34006643 0.98456207 0.92593181 
##        874        875        876        877        878        879        881 
## 0.91889219 0.56288438 0.74844012 0.16345799 0.79387866 0.48884237 0.98032748 
##        882        883        884        885        887        890        891 
## 0.91621106 0.73821576 0.93230925 0.71290179 0.87392862 0.93795173 0.48126733 
##        892        893        895        896        897        898        900 
## 0.97096905 0.66762231 0.98640544 0.98814027 0.27742245 0.99847366 0.54058835 
##        901        902        904        905        906        909        910 
## 0.79557804 0.94259871 0.96339224 0.95478319 0.78458346 0.97934381 0.80580916 
##        911        912        913        915        916        917        918 
## 0.61053947 0.52627050 0.71017024 0.25122408 0.14150761 0.98727163 0.03812747 
##        919        921        922        923        924        928        929 
## 0.54426838 0.74745292 0.87174974 0.33239424 0.72206141 0.24462156 0.96579943 
##        930        931        933        935        936        940        941 
## 0.44327792 0.84285085 0.94928361 0.28824484 0.23367536 0.98968604 0.93860117 
##        942        943        946        947        948        950        951 
## 0.97167771 0.99014908 0.16126018 0.26765895 0.86679817 0.87118816 0.81936240 
##        955        957        958        959        961        962        963 
## 0.51917575 0.87693272 0.90988047 0.27688304 0.96166076 0.33831267 0.76861033 
##        964        966        967        968        969        970        971 
## 0.95272377 0.67310879 0.82418813 0.75116460 0.91004654 0.67519736 0.78465824 
##        972        974        975        976        978        979        980 
## 0.59270957 0.02151977 0.92769932 0.85712003 0.82919412 0.73268463 0.20333165 
##        981        983        984        985        986        988        989 
## 0.77464206 0.71982687 0.47173397 0.98461483 0.44355161 0.92342286 0.70474587 
##        990        991        993        994        996        999       1000 
## 0.76895996 0.97378937 0.80465761 0.47973885 0.94768552 0.30098042 0.89296843

Your observation: Predicted probabilities were successfully generated for all 700 observations in the training set. The values range between 0 and 1, representing the likelihood that each observation is classified as a good credit risk.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

cutoffs <- seq(0.1, 0.9, by = 0.01)

mr_values <- c()

for (c in cutoffs) {
  preds <- ifelse(train_probs > c, TRUE, FALSE)
  
  cm <- table(preds, train_data$Class)
  
  mr <- 1 - sum(diag(cm)) / sum(cm)
  
  mr_values <- c(mr_values, mr)
}

optimal_cutoff <- cutoffs[which.min(mr_values)]
optimal_cutoff

## [1] 0.41

Your observation: The optimal probability cutoff was found by testing multiple threshold values and selecting the one that minimizes the misclassification rate. The cutoff value of 0.41 resulted in the lowest error and was chosen as the optimal threshold.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

# Apply optimal cutoff
train_preds <- ifelse(train_probs > 0.41, TRUE, FALSE)

# Confusion matrix
cm_train <- table(Predicted = train_preds, Actual = train_data$Class)

# Misclassification rate
mr_train <- 1 - sum(diag(cm_train)) / sum(cm_train)

# Output results
cm_train

##          Actual
## Predicted FALSE TRUE
##     FALSE   103   27
##     TRUE    107  463

mr_train

## [1] 0.1914286

Your observation: The confusion matrix was generated using the optimal cutoff of 0.41. The misclassification rate (MR) for the training set is 0.191, meaning that approximately 19.1% of the observations were incorrectly classified.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

# Load library
library(pROC)

# Generate ROC curve
roc_obj <- roc(train_data$Class, train_probs)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

# Plot ROC curve
plot(roc_obj, main = "ROC Curve - Training Set")

# Calculate AUC
auc_value <- auc(roc_obj)

# Output AUC
auc_value

## Area under the curve: 0.8497

Your observation: The ROC curve was generated for the training set to evaluate the model’s performance. The curve is well above the diagonal line, indicating strong classification ability. The AUC value is [your AUC], suggesting that the model performs well in distinguishing between good and bad credit risk.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

# Get predicted probabilities for test set
test_probs <- predict(log_model, newdata = test_data, type = "response")

# Apply optimal cutoff (0.41)
test_preds <- ifelse(test_probs > 0.41, TRUE, FALSE)

# Confusion matrix
cm_test <- table(Predicted = test_preds, Actual = test_data$Class)

# Misclassification rate
mr_test <- 1 - sum(diag(cm_test)) / sum(cm_test)

# Output results
cm_test

##          Actual
## Predicted FALSE TRUE
##     FALSE    30   22
##     TRUE     60  188

mr_test

## [1] 0.2733333

Your observation: The confusion matrix was generated for the test set using the optimal cutoff of 0.41. The misclassification rate (MR) is 0.273, meaning that approximately 27.3% of the observations were incorrectly classified.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

# ROC curve for test set
roc_test <- roc(test_data$Class, test_probs)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

# Plot ROC curve
plot(roc_test, main = "ROC Curve - Test Set")

# Calculate AUC
auc_test <- auc(roc_test)

# Output AUC
auc_test

## Area under the curve: 0.7562

Your observation: The ROC curve was generated for the test set to evaluate model performance. The AUC value is 0.7562, indicating that the model has a good ability to distinguish between good and bad credit risk on unseen data.

Task 5: Using different weights (20pts)

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

# Define weights
weight_FP <- 5
weight_FN <- 1

# Function to calculate cost
cost_function <- function(cm) {
  FP <- cm[2,1]  # predicted TRUE, actual FALSE
  FN <- cm[1,2]  # predicted FALSE, actual TRUE
  total <- sum(cm)
  
  cost <- (weight_FP * FP + weight_FN * FN) / total
  return(cost)}

Your observation: A higher weight of 5 was assigned to false positives because misclassifying a bad customer as good is more costly. A lower weight of 1 was assigned to false negatives. The cost function was defined to reflect these differences in classification errors.

2. Obtain the confusion matrix and MR for the training set.

# Use same cutoff (0.41)
train_preds <- ifelse(train_probs > 0.41, TRUE, FALSE)

# Confusion matrix
cm_train <- table(Predicted = train_preds, Actual = train_data$Class)

# Misclassification rate (MR)
mr_train <- 1 - sum(diag(cm_train)) / sum(cm_train)

# Output
cm_train

##          Actual
## Predicted FALSE TRUE
##     FALSE   103   27
##     TRUE    107  463

mr_train

## [1] 0.1914286

Your observation: The confusion matrix was generated for the training set using the cutoff of 0.41. The misclassification rate (MR) is 0.191, meaning that approximately 19.1% of the observations were incorrectly classified.

3. Obtain the confusion matrix and MR for the test set.

# Use same cutoff (0.41)
test_preds <- ifelse(test_probs > 0.41, TRUE, FALSE)

# Confusion matrix
cm_test <- table(Predicted = test_preds, Actual = test_data$Class)

# Misclassification rate (MR)
mr_test <- 1 - sum(diag(cm_test)) / sum(cm_test)

# Output
cm_test

##          Actual
## Predicted FALSE TRUE
##     FALSE    30   22
##     TRUE     60  188

mr_test

## [1] 0.2733333

Your observation: The confusion matrix was generated for the test set using the cutoff of 0.41. The misclassification rate (MR) is 0.273, meaning that approximately 27.3% of the observations were incorrectly classified.

Task 6: Conlusion (10pts)

Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model further.

The optimal cutoff was 0.41, which gave the lowest misclassification rate on the training data. The training MR was 0.191, while the test MR was 0.273, showing that the model performs a bit worse on new data. The AUC for the test set was 0.7562, which means the model does a decent job at distinguishing between good and bad credit risk.

Overall, the model works pretty well, but the higher error on the test set suggests it may be slightly overfitting. To improve it, I could try reducing the number of variables, using regularization, or testing other models like decision trees or random forests. I could also adjust the cutoff based on cost instead of just minimizing error.

Homework4

Serina Zavala

04/06/2026

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

Homework4

Serina Zavala

04/06/2026

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as 2024 for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)