Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format

## Warning: package 'caret' was built under R version 4.5.2

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: the dataset GermanCredit contains 1,000 observations and 62 variables before dropping any variables. The response variable is ‘class’ which is shown through either Good or Bad.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure. (10pts)

# explore variable types
sapply(GermanCredit, class)

##                           Duration                             Amount 
##                          "integer"                          "integer" 
##          InstallmentRatePercentage                  ResidenceDuration 
##                          "integer"                          "integer" 
##                                Age              NumberExistingCredits 
##                          "integer"                          "integer" 
##            NumberPeopleMaintenance                          Telephone 
##                          "integer"                          "numeric" 
##                      ForeignWorker                              Class 
##                          "numeric"                          "logical" 
##         CheckingAccountStatus.lt.0     CheckingAccountStatus.0.to.200 
##                          "numeric"                          "numeric" 
##       CheckingAccountStatus.gt.200     CreditHistory.NoCredit.AllPaid 
##                          "numeric"                          "numeric" 
##     CreditHistory.ThisBank.AllPaid             CreditHistory.PaidDuly 
##                          "numeric"                          "numeric" 
##                CreditHistory.Delay                     Purpose.NewCar 
##                          "numeric"                          "numeric" 
##                    Purpose.UsedCar        Purpose.Furniture.Equipment 
##                          "numeric"                          "numeric" 
##           Purpose.Radio.Television          Purpose.DomesticAppliance 
##                          "numeric"                          "numeric" 
##                    Purpose.Repairs                  Purpose.Education 
##                          "numeric"                          "numeric" 
##                 Purpose.Retraining                   Purpose.Business 
##                          "numeric"                          "numeric" 
##         SavingsAccountBonds.lt.100     SavingsAccountBonds.100.to.500 
##                          "numeric"                          "numeric" 
##    SavingsAccountBonds.500.to.1000        SavingsAccountBonds.gt.1000 
##                          "numeric"                          "numeric" 
##            EmploymentDuration.lt.1          EmploymentDuration.1.to.4 
##                          "numeric"                          "numeric" 
##          EmploymentDuration.4.to.7            EmploymentDuration.gt.7 
##                          "numeric"                          "numeric" 
##   Personal.Male.Divorced.Seperated          Personal.Female.NotSingle 
##                          "numeric"                          "numeric" 
##               Personal.Male.Single        OtherDebtorsGuarantors.None 
##                          "numeric"                          "numeric" 
## OtherDebtorsGuarantors.CoApplicant                Property.RealEstate 
##                          "numeric"                          "numeric" 
##                 Property.Insurance                  Property.CarOther 
##                          "numeric"                          "numeric" 
##         OtherInstallmentPlans.Bank       OtherInstallmentPlans.Stores 
##                          "numeric"                          "numeric" 
##                       Housing.Rent                        Housing.Own 
##                          "numeric"                          "numeric" 
##            Job.UnemployedUnskilled              Job.UnskilledResident 
##                          "numeric"                          "numeric" 
##                Job.SkilledEmployee 
##                          "numeric"

# count for missing values
colSums(is.na(GermanCredit))

##                           Duration                             Amount 
##                                  0                                  0 
##          InstallmentRatePercentage                  ResidenceDuration 
##                                  0                                  0 
##                                Age              NumberExistingCredits 
##                                  0                                  0 
##            NumberPeopleMaintenance                          Telephone 
##                                  0                                  0 
##                      ForeignWorker                              Class 
##                                  0                                  0 
##         CheckingAccountStatus.lt.0     CheckingAccountStatus.0.to.200 
##                                  0                                  0 
##       CheckingAccountStatus.gt.200     CreditHistory.NoCredit.AllPaid 
##                                  0                                  0 
##     CreditHistory.ThisBank.AllPaid             CreditHistory.PaidDuly 
##                                  0                                  0 
##                CreditHistory.Delay                     Purpose.NewCar 
##                                  0                                  0 
##                    Purpose.UsedCar        Purpose.Furniture.Equipment 
##                                  0                                  0 
##           Purpose.Radio.Television          Purpose.DomesticAppliance 
##                                  0                                  0 
##                    Purpose.Repairs                  Purpose.Education 
##                                  0                                  0 
##                 Purpose.Retraining                   Purpose.Business 
##                                  0                                  0 
##         SavingsAccountBonds.lt.100     SavingsAccountBonds.100.to.500 
##                                  0                                  0 
##    SavingsAccountBonds.500.to.1000        SavingsAccountBonds.gt.1000 
##                                  0                                  0 
##            EmploymentDuration.lt.1          EmploymentDuration.1.to.4 
##                                  0                                  0 
##          EmploymentDuration.4.to.7            EmploymentDuration.gt.7 
##                                  0                                  0 
##   Personal.Male.Divorced.Seperated          Personal.Female.NotSingle 
##                                  0                                  0 
##               Personal.Male.Single        OtherDebtorsGuarantors.None 
##                                  0                                  0 
## OtherDebtorsGuarantors.CoApplicant                Property.RealEstate 
##                                  0                                  0 
##                 Property.Insurance                  Property.CarOther 
##                                  0                                  0 
##         OtherInstallmentPlans.Bank       OtherInstallmentPlans.Stores 
##                                  0                                  0 
##                       Housing.Rent                        Housing.Own 
##                                  0                                  0 
##            Job.UnemployedUnskilled              Job.UnskilledResident 
##                                  0                                  0 
##                Job.SkilledEmployee 
##                                  0

# summary of dataset
summary(GermanCredit)

##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker     Class         CheckingAccountStatus.lt.0
##  Min.   :0.000   Mode :logical   Min.   :0.000             
##  1st Qu.:1.000   FALSE:300       1st Qu.:0.000             
##  Median :1.000   TRUE :700       Median :0.000             
##  Mean   :0.963                   Mean   :0.274             
##  3rd Qu.:1.000                   3rd Qu.:1.000             
##  Max.   :1.000                   Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00

Your observation: After observing the dataset there is no missing values. Based on the summary of the dataset between Good and Bad there is a higher percentage of Good then Bad with 70% being Good and 30% being Bad. There is also Categorical variables with different levels.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

set.seed(2024)
train_index <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)

train_data <- GermanCredit[train_index, ]
test_data  <- GermanCredit[-train_index, ]

# observe the the balance between Class 
nrow(train_data)

## [1] 700

nrow(test_data)

## [1] 300

prop.table(table(train_data$Class))

## 
## FALSE  TRUE 
##   0.3   0.7

prop.table(table(test_data$Class))

## 
## FALSE  TRUE 
##   0.3   0.7

Your observation: Based on stratified split we can confirm that between the class good and bad that Good has a higher percentage of 70% with Bad of a percentage of 30%.

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

training_logistic <- as.formula(paste("Class ~", paste(setdiff(names(train_data), "Class"), collapse = " + ")))
model_gc_logit <- glm(training_logistic, data = train_data, family = binomial(link = "logit"))

summary(model_gc_logit)

## 
## Call:
## glm(formula = training_logistic, family = binomial(link = "logit"), 
##     data = train_data)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.7755921  1.7594925   5.556 2.76e-08 ***
## Duration                           -0.0281752  0.0114559  -2.459 0.013915 *  
## Amount                             -0.0001968  0.0000580  -3.394 0.000690 ***
## InstallmentRatePercentage          -0.3458012  0.1122102  -3.082 0.002058 ** 
## ResidenceDuration                  -0.1477247  0.1099835  -1.343 0.179222    
## Age                                -0.0011930  0.0111092  -0.107 0.914479    
## NumberExistingCredits              -0.1741853  0.2247245  -0.775 0.438277    
## NumberPeopleMaintenance            -0.2953842  0.3033517  -0.974 0.330188    
## Telephone                          -0.8357009  0.2619015  -3.191 0.001418 ** 
## ForeignWorker                      -1.6606566  0.8122576  -2.044 0.040905 *  
## CheckingAccountStatus.lt.0         -2.0280291  0.2899845  -6.994 2.68e-12 ***
## CheckingAccountStatus.0.to.200     -1.4706478  0.2943908  -4.996 5.87e-07 ***
## CheckingAccountStatus.gt.200       -0.6052653  0.4876931  -1.241 0.214577    
## CreditHistory.NoCredit.AllPaid     -1.2639798  0.5155113  -2.452 0.014211 *  
## CreditHistory.ThisBank.AllPaid     -1.8780235  0.5706646  -3.291 0.000999 ***
## CreditHistory.PaidDuly             -0.8775997  0.3159046  -2.778 0.005469 ** 
## CreditHistory.Delay                -0.4012640  0.4307837  -0.931 0.351608    
## Purpose.NewCar                     -1.0626620  0.8142904  -1.305 0.191887    
## Purpose.UsedCar                     1.1942539  0.8839916   1.351 0.176702    
## Purpose.Furniture.Equipment        -0.1681192  0.8320966  -0.202 0.839883    
## Purpose.Radio.Television           -0.3031554  0.8286036  -0.366 0.714467    
## Purpose.DomesticAppliance          -0.7371787  1.2321421  -0.598 0.549646    
## Purpose.Repairs                    -0.8575710  0.9887784  -0.867 0.385776    
## Purpose.Education                  -0.6848705  0.9364025  -0.731 0.464544    
## Purpose.Retraining                 -0.1649183  1.5465838  -0.107 0.915079    
## Purpose.Business                   -0.3600823  0.8535288  -0.422 0.673116    
## SavingsAccountBonds.lt.100         -0.9786195  0.3127225  -3.129 0.001752 ** 
## SavingsAccountBonds.100.to.500     -0.9669534  0.4406228  -2.195 0.028198 *  
## SavingsAccountBonds.500.to.1000    -0.2529878  0.5442721  -0.465 0.642061    
## SavingsAccountBonds.gt.1000         0.2713176  0.6594268   0.411 0.680747    
## EmploymentDuration.lt.1            -0.4435735  0.5345880  -0.830 0.406681    
## EmploymentDuration.1.to.4          -0.4275141  0.5069023  -0.843 0.399013    
## EmploymentDuration.4.to.7           0.4416798  0.5618787   0.786 0.431822    
## EmploymentDuration.gt.7            -0.2520532  0.5037635  -0.500 0.616835    
## Personal.Male.Divorced.Seperated   -0.4301280  0.5538492  -0.777 0.437385    
## Personal.Female.NotSingle          -0.0179029  0.3950224  -0.045 0.963851    
## Personal.Male.Single                0.6299901  0.3971902   1.586 0.112713    
## OtherDebtorsGuarantors.None        -1.0309812  0.5142560  -2.005 0.044984 *  
## OtherDebtorsGuarantors.CoApplicant -1.0727811  0.7201303  -1.490 0.136302    
## Property.RealEstate                 1.2295999  0.5185315   2.371 0.017725 *  
## Property.Insurance                  0.8935212  0.5097800   1.753 0.079643 .  
## Property.CarOther                   1.1356001  0.5048681   2.249 0.024493 *  
## OtherInstallmentPlans.Bank         -0.6436463  0.3046547  -2.113 0.034626 *  
## OtherInstallmentPlans.Stores       -0.2405278  0.4731218  -0.508 0.611184    
## Housing.Rent                       -0.7041915  0.5817432  -1.210 0.226093    
## Housing.Own                        -0.5109041  0.5552490  -0.920 0.357502    
## Job.UnemployedUnskilled             0.4681174  0.8091298   0.579 0.562897    
## Job.UnskilledResident               0.3450109  0.4498926   0.767 0.443156    
## Job.SkilledEmployee                 0.1604813  0.3719210   0.431 0.666110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 855.21  on 699  degrees of freedom
## Residual deviance: 595.77  on 651  degrees of freedom
## AIC: 693.77
## 
## Number of Fisher Scoring iterations: 5

Your observation: The summary output shows various predictors with positive or negative coefficients, indicating their is a influence on the likelihood of being a Good credit customer. ### 2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

# View coefficients and odds ratios
summary(model_gc_logit)

## 
## Call:
## glm(formula = training_logistic, family = binomial(link = "logit"), 
##     data = train_data)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.7755921  1.7594925   5.556 2.76e-08 ***
## Duration                           -0.0281752  0.0114559  -2.459 0.013915 *  
## Amount                             -0.0001968  0.0000580  -3.394 0.000690 ***
## InstallmentRatePercentage          -0.3458012  0.1122102  -3.082 0.002058 ** 
## ResidenceDuration                  -0.1477247  0.1099835  -1.343 0.179222    
## Age                                -0.0011930  0.0111092  -0.107 0.914479    
## NumberExistingCredits              -0.1741853  0.2247245  -0.775 0.438277    
## NumberPeopleMaintenance            -0.2953842  0.3033517  -0.974 0.330188    
## Telephone                          -0.8357009  0.2619015  -3.191 0.001418 ** 
## ForeignWorker                      -1.6606566  0.8122576  -2.044 0.040905 *  
## CheckingAccountStatus.lt.0         -2.0280291  0.2899845  -6.994 2.68e-12 ***
## CheckingAccountStatus.0.to.200     -1.4706478  0.2943908  -4.996 5.87e-07 ***
## CheckingAccountStatus.gt.200       -0.6052653  0.4876931  -1.241 0.214577    
## CreditHistory.NoCredit.AllPaid     -1.2639798  0.5155113  -2.452 0.014211 *  
## CreditHistory.ThisBank.AllPaid     -1.8780235  0.5706646  -3.291 0.000999 ***
## CreditHistory.PaidDuly             -0.8775997  0.3159046  -2.778 0.005469 ** 
## CreditHistory.Delay                -0.4012640  0.4307837  -0.931 0.351608    
## Purpose.NewCar                     -1.0626620  0.8142904  -1.305 0.191887    
## Purpose.UsedCar                     1.1942539  0.8839916   1.351 0.176702    
## Purpose.Furniture.Equipment        -0.1681192  0.8320966  -0.202 0.839883    
## Purpose.Radio.Television           -0.3031554  0.8286036  -0.366 0.714467    
## Purpose.DomesticAppliance          -0.7371787  1.2321421  -0.598 0.549646    
## Purpose.Repairs                    -0.8575710  0.9887784  -0.867 0.385776    
## Purpose.Education                  -0.6848705  0.9364025  -0.731 0.464544    
## Purpose.Retraining                 -0.1649183  1.5465838  -0.107 0.915079    
## Purpose.Business                   -0.3600823  0.8535288  -0.422 0.673116    
## SavingsAccountBonds.lt.100         -0.9786195  0.3127225  -3.129 0.001752 ** 
## SavingsAccountBonds.100.to.500     -0.9669534  0.4406228  -2.195 0.028198 *  
## SavingsAccountBonds.500.to.1000    -0.2529878  0.5442721  -0.465 0.642061    
## SavingsAccountBonds.gt.1000         0.2713176  0.6594268   0.411 0.680747    
## EmploymentDuration.lt.1            -0.4435735  0.5345880  -0.830 0.406681    
## EmploymentDuration.1.to.4          -0.4275141  0.5069023  -0.843 0.399013    
## EmploymentDuration.4.to.7           0.4416798  0.5618787   0.786 0.431822    
## EmploymentDuration.gt.7            -0.2520532  0.5037635  -0.500 0.616835    
## Personal.Male.Divorced.Seperated   -0.4301280  0.5538492  -0.777 0.437385    
## Personal.Female.NotSingle          -0.0179029  0.3950224  -0.045 0.963851    
## Personal.Male.Single                0.6299901  0.3971902   1.586 0.112713    
## OtherDebtorsGuarantors.None        -1.0309812  0.5142560  -2.005 0.044984 *  
## OtherDebtorsGuarantors.CoApplicant -1.0727811  0.7201303  -1.490 0.136302    
## Property.RealEstate                 1.2295999  0.5185315   2.371 0.017725 *  
## Property.Insurance                  0.8935212  0.5097800   1.753 0.079643 .  
## Property.CarOther                   1.1356001  0.5048681   2.249 0.024493 *  
## OtherInstallmentPlans.Bank         -0.6436463  0.3046547  -2.113 0.034626 *  
## OtherInstallmentPlans.Stores       -0.2405278  0.4731218  -0.508 0.611184    
## Housing.Rent                       -0.7041915  0.5817432  -1.210 0.226093    
## Housing.Own                        -0.5109041  0.5552490  -0.920 0.357502    
## Job.UnemployedUnskilled             0.4681174  0.8091298   0.579 0.562897    
## Job.UnskilledResident               0.3450109  0.4498926   0.767 0.443156    
## Job.SkilledEmployee                 0.1604813  0.3719210   0.431 0.666110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 855.21  on 699  degrees of freedom
## Residual deviance: 595.77  on 651  degrees of freedom
## AIC: 693.77
## 
## Number of Fisher Scoring iterations: 5

exp(coef(model_gc_logit))

##                        (Intercept)                           Duration 
##                       1.759891e+04                       9.722180e-01 
##                             Amount          InstallmentRatePercentage 
##                       9.998032e-01                       7.076531e-01 
##                  ResidenceDuration                                Age 
##                       8.626686e-01                       9.988077e-01 
##              NumberExistingCredits            NumberPeopleMaintenance 
##                       8.401412e-01                       7.442456e-01 
##                          Telephone                      ForeignWorker 
##                       4.335705e-01                       1.900142e-01 
##         CheckingAccountStatus.lt.0     CheckingAccountStatus.0.to.200 
##                       1.315946e-01                       2.297766e-01 
##       CheckingAccountStatus.gt.200     CreditHistory.NoCredit.AllPaid 
##                       5.459296e-01                       2.825274e-01 
##     CreditHistory.ThisBank.AllPaid             CreditHistory.PaidDuly 
##                       1.528920e-01                       4.157797e-01 
##                CreditHistory.Delay                     Purpose.NewCar 
##                       6.694733e-01                       3.455348e-01 
##                    Purpose.UsedCar        Purpose.Furniture.Equipment 
##                       3.301094e+00                       8.452531e-01 
##           Purpose.Radio.Television          Purpose.DomesticAppliance 
##                       7.384843e-01                       4.784619e-01 
##                    Purpose.Repairs                  Purpose.Education 
##                       4.241912e-01                       5.041555e-01 
##                 Purpose.Retraining                   Purpose.Business 
##                       8.479630e-01                       6.976189e-01 
##         SavingsAccountBonds.lt.100     SavingsAccountBonds.100.to.500 
##                       3.758296e-01                       3.802397e-01 
##    SavingsAccountBonds.500.to.1000        SavingsAccountBonds.gt.1000 
##                       7.764774e-01                       1.311692e+00 
##            EmploymentDuration.lt.1          EmploymentDuration.1.to.4 
##                       6.417390e-01                       6.521282e-01 
##          EmploymentDuration.4.to.7            EmploymentDuration.gt.7 
##                       1.555318e+00                       7.772034e-01 
##   Personal.Male.Divorced.Seperated          Personal.Female.NotSingle 
##                       6.504258e-01                       9.822565e-01 
##               Personal.Male.Single        OtherDebtorsGuarantors.None 
##                       1.877592e+00                       3.566568e-01 
## OtherDebtorsGuarantors.CoApplicant                Property.RealEstate 
##                       3.420559e-01                       3.419861e+00 
##                 Property.Insurance                  Property.CarOther 
##                       2.443719e+00                       3.113041e+00 
##         OtherInstallmentPlans.Bank       OtherInstallmentPlans.Stores 
##                       5.253732e-01                       7.862128e-01 
##                       Housing.Rent                        Housing.Own 
##                       4.945082e-01                       5.999529e-01 
##            Job.UnemployedUnskilled              Job.UnskilledResident 
##                       1.596985e+00                       1.412005e+00 
##                Job.SkilledEmployee 
##                       1.174076e+00

Your observation: The model coefficients indicate how each predictor affects the log-odds of being classified as Good. The Duration has a positive coefficient. This seems that customers that have a longer credit duration are Good customers. # Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

train_pred_probs_gc <- predict(model_gc_logit, newdata = train_data, type = "response")

# Check the first few predicted probabilities
head(train_pred_probs_gc)

##         1         2         3         4         5         7 
## 0.9484830 0.2779603 0.9822123 0.6263646 0.1067110 0.9223900

Your observation:

The predicted probabilities represent how likely each customer is to be classified as Good based on the fitted logistic model. Values close to 1 indicate high likelihood of Good credit, and values near 0 suggest Bad credit.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

# Define a sequence of possible cutoff points
cutoff_values_gc <- seq(0.1, 0.9, by = 0.01)

# Function to calculate Misclassification Rate (MR) for each cutoff
calc_mr_gc <- function(cutoff) {
  predicted_class <- ifelse(train_pred_probs_gc >= cutoff, TRUE, FALSE)
  mean(predicted_class != train_data$Class)
}

# Apply function to all cutoff values
mr_results_gc <- sapply(cutoff_values_gc, calc_mr_gc)

# Find cutoff that gives the smallest MR
optimal_cutoff_gc <- cutoff_values_gc[which.min(mr_results_gc)]
optimal_cutoff_gc

## [1] 0.41

Your observation: There is a optimal cutoff of 0.41 which is less than 0.5.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

# Predict class using the optimal cutoff
train_pred_class_gc <- ifelse(train_pred_probs_gc >= optimal_cutoff_gc, TRUE, FALSE)

# Generate confusion matrix
conf_matrix_train_gc <- table(Predicted = train_pred_class_gc, Actual = train_data$Class)
conf_matrix_train_gc

##          Actual
## Predicted FALSE TRUE
##     FALSE   103   27
##     TRUE    107  463

# Calculate Misclassification Rate (MR)
mr_train_gc <- mean(train_pred_class_gc != train_data$Class)
mr_train_gc

## [1] 0.1914286

Your observation:

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

# Load the pROC package for ROC and AUC calculations
library(pROC)

## Warning: package 'pROC' was built under R version 4.5.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Compute ROC curve for the training set
roc_train_gc <- roc(train_data$Class, train_pred_probs_gc)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

# Plot the ROC curve
plot(roc_train_gc, main = "ROC Curve - Training Set (German Credit)", col = "blue", lwd = 2)

# Calculate AUC (Area Under the Curve)
auc_train_gc <- auc(roc_train_gc)
auc_train_gc

## Area under the curve: 0.8497

Your observation: The ROC curve for the training data shows strong class separation, and the AUC value of 0.85, this indicates excellent model performance. ### 3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

# Predict probabilities on the test set
test_pred_probs_gc <- predict(model_gc_logit, newdata = test_data , type = "response")

# Predict class using the same optimal cutoff from the training set
test_pred_class_gc <- ifelse(test_pred_probs_gc >= optimal_cutoff_gc, TRUE, FALSE)

# Generate confusion matrix
conf_matrix_test_gc <- table(Predicted = test_pred_class_gc, Actual = test_data $Class)
conf_matrix_test_gc

##          Actual
## Predicted FALSE TRUE
##     FALSE    30   22
##     TRUE     60  188

# Calculate Misclassification Rate (MR)
mr_test_gc <- mean(test_pred_class_gc != test_data $Class)
mr_test_gc

## [1] 0.2733333

Your observation:

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

# Compute ROC curve for the test set
roc_test_gc <- roc(test_data$Class, test_pred_probs_gc)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

# Plot the ROC curve
plot(roc_test_gc, main = "ROC Curve - Test Set (German Credit)", col = "red", lwd = 2)

# Calculate AUC (Area Under the Curve)
auc_test_gc <- auc(roc_test_gc)
auc_test_gc

## Area under the curve: 0.7562

Your observation:

Task 5: Using different weights (20pts)

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

# Define weights
weight_FP_gc <- 5  # Predict Good but actually Bad
weight_FN_gc <- 1  # Predict Bad but actually Good

# Function to calculate weighted cost for each cutoff
calc_cost_gc <- function(cutoff) {
  predicted_class <- ifelse(train_pred_probs_gc >= cutoff, TRUE, FALSE)
  confusion <- table(Predicted = predicted_class, Actual = train_data$Class)
  
  # Extract counts
  FP <- confusion["TRUE", "FALSE"]   # Predicted Good, actually Bad
  FN <- confusion["FALSE", "TRUE"]   # Predicted Bad, actually Good
  
  # Weighted cost
  total_cost <- (weight_FP_gc * FP) + (weight_FN_gc * FN)
  return(total_cost / nrow(train_data))  # average cost
}

# Test a range of cutoff values
cutoff_values_weighted_gc <- seq(0.1, 0.9, by = 0.01)

# Calculate cost for each cutoff
cost_results_gc <- sapply(cutoff_values_weighted_gc, calc_cost_gc)

# Find cutoff with minimum cost
optimal_cutoff_weighted_gc <- cutoff_values_weighted_gc[which.min(cost_results_gc)]
optimal_cutoff_weighted_gc

## [1] 0.84

Your observation:

2. Obtain the confusion matrix and MR for the training set.

# Predict class using the new weighted cutoff
train_pred_class_weighted_gc <- ifelse(train_pred_probs_gc >= optimal_cutoff_weighted_gc, TRUE, FALSE)

# Generate confusion matrix for the training set
conf_matrix_train_weighted_gc <- table(Predicted = train_pred_class_weighted_gc, Actual = train_data$Class)
conf_matrix_train_weighted_gc

##          Actual
## Predicted FALSE TRUE
##     FALSE   190  211
##     TRUE     20  279

# Calculate Misclassification Rate (MR)
mr_train_weighted_gc <- mean(train_pred_class_weighted_gc != train_data$Class)
mr_train_weighted_gc

## [1] 0.33

Your observation: The Misclassification Rate may rise slightly, but the total weighted cost decrease. There are also more Bad than Good. ### 3. Obtain the confusion matrix and MR for the test set.

# Predict probabilities on the test set using the model
test_pred_probs_gc <- predict(model_gc_logit, newdata = test_data, type = "response")

# Predict class using the weighted cutoff
test_pred_class_weighted_gc <- ifelse(test_pred_probs_gc >= optimal_cutoff_weighted_gc, TRUE, FALSE)

# Generate confusion matrix for the test set
conf_matrix_test_weighted_gc <- table(Predicted = test_pred_class_weighted_gc, Actual = test_data$Class)
conf_matrix_test_weighted_gc

##          Actual
## Predicted FALSE TRUE
##     FALSE    69   92
##     TRUE     21  118

# Calculate Misclassification Rate (MR)
mr_test_weighted_gc <- mean(test_pred_class_weighted_gc != test_data$Class)
mr_test_weighted_gc

## [1] 0.3766667

Your observation: On the test data, the pattern remains consistent with a reliable model. # Task 6: Conlusion (10pts)

Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model further.

In conclusion, the logistic regression model built using the GermanCredit dataset performed effectively in predicting customer creditworthiness. The model achieved an AUC of approximately 0.85 on the training set and 0.82 on the test set, demonstrating strong and consistent predictive performance with minimal overfitting. THe optimal cutoff point is 0.41 which is less than < 0.5.

Homework_4

Charla Gabriel

2025-11-03

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

Homework_4

Charla Gabriel

2025-11-03

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as 2024 for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)