Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
if (!require("caret", quietly = TRUE)) {
install.packages("caret")
}
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: The GermanCredit dataset contains 1,000 observations and 62 variables. There are no missing values.
#This is an optional code that drop variables that provide no information in the data
#GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.
summary(GermanCredit)
## Duration Amount InstallmentRatePercentage ResidenceDuration
## Min. : 4.0 Min. : 250 Min. :1.000 Min. :1.000
## 1st Qu.:12.0 1st Qu.: 1366 1st Qu.:2.000 1st Qu.:2.000
## Median :18.0 Median : 2320 Median :3.000 Median :3.000
## Mean :20.9 Mean : 3271 Mean :2.973 Mean :2.845
## 3rd Qu.:24.0 3rd Qu.: 3972 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :72.0 Max. :18424 Max. :4.000 Max. :4.000
## Age NumberExistingCredits NumberPeopleMaintenance Telephone
## Min. :19.00 Min. :1.000 Min. :1.000 Min. :0.000
## 1st Qu.:27.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000
## Median :33.00 Median :1.000 Median :1.000 Median :1.000
## Mean :35.55 Mean :1.407 Mean :1.155 Mean :0.596
## 3rd Qu.:42.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :75.00 Max. :4.000 Max. :2.000 Max. :1.000
## ForeignWorker Class CheckingAccountStatus.lt.0
## Min. :0.000 Mode :logical Min. :0.000
## 1st Qu.:1.000 FALSE:300 1st Qu.:0.000
## Median :1.000 TRUE :700 Median :0.000
## Mean :0.963 Mean :0.274
## 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.000 Max. :1.000
## CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.269 Mean :0.063
## 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## CheckingAccountStatus.none CreditHistory.NoCredit.AllPaid
## Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:0.00
## Median :0.000 Median :0.00
## Mean :0.394 Mean :0.04
## 3rd Qu.:1.000 3rd Qu.:0.00
## Max. :1.000 Max. :1.00
## CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
## Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
## Median :0.000 Median :1.00 Median :0.000
## Mean :0.049 Mean :0.53 Mean :0.088
## 3rd Qu.:0.000 3rd Qu.:1.00 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :1.000
## CreditHistory.Critical Purpose.NewCar Purpose.UsedCar
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.293 Mean :0.234 Mean :0.103
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
## Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
## Median :0.000 Median :0.00 Median :0.000
## Mean :0.181 Mean :0.28 Mean :0.012
## 3rd Qu.:0.000 3rd Qu.:1.00 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :1.000
## Purpose.Repairs Purpose.Education Purpose.Vacation Purpose.Retraining
## Min. :0.000 Min. :0.00 Min. :0 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0 1st Qu.:0.000
## Median :0.000 Median :0.00 Median :0 Median :0.000
## Mean :0.022 Mean :0.05 Mean :0 Mean :0.009
## 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:0 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :0 Max. :1.000
## Purpose.Business Purpose.Other SavingsAccountBonds.lt.100
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :1.000
## Mean :0.097 Mean :0.012 Mean :0.603
## 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:1.000
## Max. :1.000 Max. :1.000 Max. :1.000
## SavingsAccountBonds.100.to.500 SavingsAccountBonds.500.to.1000
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.103 Mean :0.063
## 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## SavingsAccountBonds.gt.1000 SavingsAccountBonds.Unknown
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.048 Mean :0.183
## 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.172 Mean :0.339 Mean :0.174
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## EmploymentDuration.gt.7 EmploymentDuration.Unemployed
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.253 Mean :0.062
## 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## Personal.Male.Divorced.Seperated Personal.Female.NotSingle
## Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :0.00
## Mean :0.05 Mean :0.31
## 3rd Qu.:0.00 3rd Qu.:1.00
## Max. :1.00 Max. :1.00
## Personal.Male.Single Personal.Male.Married.Widowed Personal.Female.Single
## Min. :0.000 Min. :0.000 Min. :0
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0
## Median :1.000 Median :0.000 Median :0
## Mean :0.548 Mean :0.092 Mean :0
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0
## Max. :1.000 Max. :1.000 Max. :0
## OtherDebtorsGuarantors.None OtherDebtorsGuarantors.CoApplicant
## Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.000
## Median :1.000 Median :0.000
## Mean :0.907 Mean :0.041
## 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## OtherDebtorsGuarantors.Guarantor Property.RealEstate Property.Insurance
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.052 Mean :0.282 Mean :0.232
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## Property.CarOther Property.Unknown OtherInstallmentPlans.Bank
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.332 Mean :0.154 Mean :0.139
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## OtherInstallmentPlans.Stores OtherInstallmentPlans.None Housing.Rent
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000
## Median :0.000 Median :1.000 Median :0.000
## Mean :0.047 Mean :0.814 Mean :0.179
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## Housing.Own Housing.ForFree Job.UnemployedUnskilled Job.UnskilledResident
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0
## Median :1.000 Median :0.000 Median :0.000 Median :0.0
## Mean :0.713 Mean :0.108 Mean :0.022 Mean :0.2
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.0
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.0
## Job.SkilledEmployee Job.Management.SelfEmp.HighlyQualified
## Min. :0.00 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000
## Median :1.00 Median :0.000
## Mean :0.63 Mean :0.148
## 3rd Qu.:1.00 3rd Qu.:0.000
## Max. :1.00 Max. :1.000
dim(GermanCredit)
## [1] 1000 62
colSums(is.na(GermanCredit))
## Duration Amount
## 0 0
## InstallmentRatePercentage ResidenceDuration
## 0 0
## Age NumberExistingCredits
## 0 0
## NumberPeopleMaintenance Telephone
## 0 0
## ForeignWorker Class
## 0 0
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 0 0
## CheckingAccountStatus.gt.200 CheckingAccountStatus.none
## 0 0
## CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
## 0 0
## CreditHistory.PaidDuly CreditHistory.Delay
## 0 0
## CreditHistory.Critical Purpose.NewCar
## 0 0
## Purpose.UsedCar Purpose.Furniture.Equipment
## 0 0
## Purpose.Radio.Television Purpose.DomesticAppliance
## 0 0
## Purpose.Repairs Purpose.Education
## 0 0
## Purpose.Vacation Purpose.Retraining
## 0 0
## Purpose.Business Purpose.Other
## 0 0
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 0 0
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 0 0
## SavingsAccountBonds.Unknown EmploymentDuration.lt.1
## 0 0
## EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## 0 0
## EmploymentDuration.gt.7 EmploymentDuration.Unemployed
## 0 0
## Personal.Male.Divorced.Seperated Personal.Female.NotSingle
## 0 0
## Personal.Male.Single Personal.Male.Married.Widowed
## 0 0
## Personal.Female.Single OtherDebtorsGuarantors.None
## 0 0
## OtherDebtorsGuarantors.CoApplicant OtherDebtorsGuarantors.Guarantor
## 0 0
## Property.RealEstate Property.Insurance
## 0 0
## Property.CarOther Property.Unknown
## 0 0
## OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 0 0
## OtherInstallmentPlans.None Housing.Rent
## 0 0
## Housing.Own Housing.ForFree
## 0 0
## Job.UnemployedUnskilled Job.UnskilledResident
## 0 0
## Job.SkilledEmployee Job.Management.SelfEmp.HighlyQualified
## 0 0
table(GermanCredit$Class)
##
## FALSE TRUE
## 300 700
Your observation: The response variable Class (converted
to logical) is imbalanced, with approximately 700 “Good” (TRUE) and 300
“Bad” (FALSE) customers (70/30 split). Most predictors are numeric dummy
variables created from the original categorical features.
2024 for reproducibility.
(10pts)set.seed(2024)
index <- sample(1:nrow(GermanCredit), nrow(GermanCredit) * 0.8)
GermanCredit_train <- GermanCredit[index, ]
GermanCredit_test <- GermanCredit[-index, ]
dim(GermanCredit_train)
## [1] 800 62
dim(GermanCredit_test)
## [1] 200 62
table(GermanCredit_train$Class)
##
## FALSE TRUE
## 229 571
table(GermanCredit_test$Class)
##
## FALSE TRUE
## 71 129
Your observation: Using set.seed(2024), the data was
split into a training set of 800 observations and a test set of 200
observations. The class distribution is well preserved in both the
training and test sets.
glm_credit <- glm(Class ~ ., family = binomial, data = GermanCredit_train)
Your observation: The logistic regression model was successfully fitted using the training data and all available predictors. Since the response variable is binary, logistic regression is appropriate for this problem. Using all predictors allows the model to capture multiple borrower characteristics that may influence whether a customer is classified as good or bad.
summary(glm_credit)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = GermanCredit_train)
##
## Coefficients: (13 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.241e+00 1.719e+00 5.376 7.61e-08
## Duration -2.994e-02 1.072e-02 -2.794 0.005214
## Amount -1.771e-04 5.095e-05 -3.475 0.000510
## InstallmentRatePercentage -3.718e-01 1.036e-01 -3.589 0.000332
## ResidenceDuration 2.577e-02 1.010e-01 0.255 0.798510
## Age 1.183e-02 1.097e-02 1.078 0.280974
## NumberExistingCredits -1.225e-01 2.189e-01 -0.560 0.575690
## NumberPeopleMaintenance -1.731e-01 2.945e-01 -0.588 0.556678
## Telephone -4.236e-01 2.371e-01 -1.786 0.074081
## ForeignWorker -1.651e+00 7.421e-01 -2.224 0.026143
## CheckingAccountStatus.lt.0 -1.817e+00 2.710e-01 -6.703 2.04e-11
## CheckingAccountStatus.0.to.200 -1.432e+00 2.686e-01 -5.330 9.81e-08
## CheckingAccountStatus.gt.200 -5.912e-01 4.631e-01 -1.277 0.201696
## CheckingAccountStatus.none NA NA NA NA
## CreditHistory.NoCredit.AllPaid -8.724e-01 5.139e-01 -1.698 0.089584
## CreditHistory.ThisBank.AllPaid -1.676e+00 5.493e-01 -3.052 0.002277
## CreditHistory.PaidDuly -6.686e-01 2.939e-01 -2.275 0.022899
## CreditHistory.Delay -9.413e-01 3.780e-01 -2.491 0.012756
## CreditHistory.Critical NA NA NA NA
## Purpose.NewCar -1.733e+00 1.013e+00 -1.710 0.087282
## Purpose.UsedCar 6.716e-02 1.033e+00 0.065 0.948146
## Purpose.Furniture.Equipment -8.257e-01 1.015e+00 -0.814 0.415816
## Purpose.Radio.Television -8.386e-01 1.019e+00 -0.823 0.410457
## Purpose.DomesticAppliance -1.227e+00 1.328e+00 -0.923 0.355762
## Purpose.Repairs -1.321e+00 1.165e+00 -1.134 0.256825
## Purpose.Education -2.020e+00 1.088e+00 -1.857 0.063374
## Purpose.Vacation NA NA NA NA
## Purpose.Retraining 4.276e-01 1.640e+00 0.261 0.794237
## Purpose.Business -8.618e-01 1.032e+00 -0.835 0.403529
## Purpose.Other NA NA NA NA
## SavingsAccountBonds.lt.100 -1.266e+00 3.201e-01 -3.956 7.63e-05
## SavingsAccountBonds.100.to.500 -1.075e+00 4.171e-01 -2.577 0.009964
## SavingsAccountBonds.500.to.1000 -8.768e-01 5.216e-01 -1.681 0.092761
## SavingsAccountBonds.gt.1000 1.301e-02 6.161e-01 0.021 0.983157
## SavingsAccountBonds.Unknown NA NA NA NA
## EmploymentDuration.lt.1 3.581e-01 5.167e-01 0.693 0.488195
## EmploymentDuration.1.to.4 5.527e-01 5.000e-01 1.105 0.268967
## EmploymentDuration.4.to.7 9.863e-01 5.355e-01 1.842 0.065524
## EmploymentDuration.gt.7 5.253e-01 5.039e-01 1.042 0.297218
## EmploymentDuration.Unemployed NA NA NA NA
## Personal.Male.Divorced.Seperated -2.546e-01 5.214e-01 -0.488 0.625274
## Personal.Female.NotSingle -1.274e-01 3.573e-01 -0.357 0.721452
## Personal.Male.Single 4.118e-01 3.623e-01 1.137 0.255622
## Personal.Male.Married.Widowed NA NA NA NA
## Personal.Female.Single NA NA NA NA
## OtherDebtorsGuarantors.None -1.239e+00 5.370e-01 -2.308 0.021018
## OtherDebtorsGuarantors.CoApplicant -1.565e+00 6.828e-01 -2.292 0.021919
## OtherDebtorsGuarantors.Guarantor NA NA NA NA
## Property.RealEstate 7.166e-01 4.898e-01 1.463 0.143477
## Property.Insurance 3.544e-01 4.785e-01 0.741 0.458926
## Property.CarOther 6.110e-01 4.648e-01 1.314 0.188702
## Property.Unknown NA NA NA NA
## OtherInstallmentPlans.Bank -8.504e-01 2.730e-01 -3.115 0.001838
## OtherInstallmentPlans.Stores -4.293e-01 4.711e-01 -0.911 0.362139
## OtherInstallmentPlans.None NA NA NA NA
## Housing.Rent -9.538e-01 5.624e-01 -1.696 0.089924
## Housing.Own -2.723e-01 5.282e-01 -0.516 0.606157
## Housing.ForFree NA NA NA NA
## Job.UnemployedUnskilled 1.449e+00 8.788e-01 1.649 0.099175
## Job.UnskilledResident -2.641e-03 4.101e-01 -0.006 0.994861
## Job.SkilledEmployee -1.073e-02 3.349e-01 -0.032 0.974438
## Job.Management.SelfEmp.HighlyQualified NA NA NA NA
##
## (Intercept) ***
## Duration **
## Amount ***
## InstallmentRatePercentage ***
## ResidenceDuration
## Age
## NumberExistingCredits
## NumberPeopleMaintenance
## Telephone .
## ForeignWorker *
## CheckingAccountStatus.lt.0 ***
## CheckingAccountStatus.0.to.200 ***
## CheckingAccountStatus.gt.200
## CheckingAccountStatus.none
## CreditHistory.NoCredit.AllPaid .
## CreditHistory.ThisBank.AllPaid **
## CreditHistory.PaidDuly *
## CreditHistory.Delay *
## CreditHistory.Critical
## Purpose.NewCar .
## Purpose.UsedCar
## Purpose.Furniture.Equipment
## Purpose.Radio.Television
## Purpose.DomesticAppliance
## Purpose.Repairs
## Purpose.Education .
## Purpose.Vacation
## Purpose.Retraining
## Purpose.Business
## Purpose.Other
## SavingsAccountBonds.lt.100 ***
## SavingsAccountBonds.100.to.500 **
## SavingsAccountBonds.500.to.1000 .
## SavingsAccountBonds.gt.1000
## SavingsAccountBonds.Unknown
## EmploymentDuration.lt.1
## EmploymentDuration.1.to.4
## EmploymentDuration.4.to.7 .
## EmploymentDuration.gt.7
## EmploymentDuration.Unemployed
## Personal.Male.Divorced.Seperated
## Personal.Female.NotSingle
## Personal.Male.Single
## Personal.Male.Married.Widowed
## Personal.Female.Single
## OtherDebtorsGuarantors.None *
## OtherDebtorsGuarantors.CoApplicant *
## OtherDebtorsGuarantors.Guarantor
## Property.RealEstate
## Property.Insurance
## Property.CarOther
## Property.Unknown
## OtherInstallmentPlans.Bank **
## OtherInstallmentPlans.Stores
## OtherInstallmentPlans.None
## Housing.Rent .
## Housing.Own
## Housing.ForFree
## Job.UnemployedUnskilled .
## Job.UnskilledResident
## Job.SkilledEmployee
## Job.Management.SelfEmp.HighlyQualified
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 958.02 on 799 degrees of freedom
## Residual deviance: 672.78 on 751 degrees of freedom
## AIC: 770.78
##
## Number of Fisher Scoring iterations: 5
Your observation: The model output shows that several predictors are statistically significant, meaning they help explain the probability that a customer is classified as good. In particular, variables related to checking account status, savings account status, and credit history appear important.
pred_prob_train <- predict(glm_credit, type = "response")
Your observation:
Many predicted probabilities are relatively high, which makes sense
because the dataset contains more good customers than bad customers.
This suggests the model is capturing the class imbalance and assigning
higher probabilities to the majority class.
costfunc <- function(obs, pred.p, pcut) {
weight_FN <- 1
weight_FP <- 1
pred_class <- (pred.p >= pcut)
FN <- (obs == TRUE) & (pred_class == FALSE)
FP <- (obs == FALSE) & (pred_class == TRUE)
cost <- mean(weight_FN * FN + weight_FP * FP)
return(cost)
}
pcut.seq <- seq(0.01, 0.99, by = 0.01)
MR_vec <- rep(0, length(pcut.seq))
for (i in 1:length(pcut.seq)) {
MR_vec[i] <- costfunc(obs = GermanCredit_train$Class, pred.p = pred_prob_train, pcut = pcut.seq[i])
}
optimal.pcut <- pcut.seq[which.min(MR_vec)]
print(paste("Optimal cut-off (equal weight):", round(optimal.pcut, 3)))
## [1] "Optimal cut-off (equal weight): 0.39"
Your observation: Using equal weights for false negatives and false positives, the optimal probability cut-off is 0.39 This cut-off is lower than the default value of 0.50, which means the model performs better when it is slightly more willing to classify customers as good. This result likely reflects the class distribution in the data and the trade-off needed to minimize overall misclassification rate.
pred_class_train <- (pred_prob_train >= optimal.pcut) * 1
conf_train <- table(GermanCredit_train$Class, pred_class_train,
dnn = c("True", "Predicted"))
print(conf_train)
## Predicted
## True 0 1
## FALSE 106 123
## TRUE 34 537
MR_train <- mean(GermanCredit_train$Class != (pred_prob_train >= optimal.pcut))
print(paste("Training MR:", round(MR_train, 4)))
## [1] "Training MR: 0.1962"
Your observation: Using the optimal cut-off of 0.39, the model correctly classified 106 bad customers and 537 good customers in the training set, while misclassifying 123 bad customers as good and 34 good customers as bad. The training misclassification rate is 0.1962, meaning the model makes errors on about 19.62% of the training observations. This indicates reasonably good performance on the training data.
library(ROCR)
pred_train <- prediction(pred_prob_train, GermanCredit_train$Class)
ROC_train <- performance(pred_train, "tpr", "fpr")
plot(ROC_train, colorize = TRUE, main = "ROC Curve - Training")
auc_train <- performance(pred_train, "auc")
auc_train <- unlist(slot(auc_train, "y.values"))
print(paste("Training AUC:", round(auc_train, 4)))
## [1] "Training AUC: 0.8505"
Your observation: The ROC curve for the training set shows that the model has good discriminatory ability. The training AUC is 0.8505, which is well above 0.50 and indicates that the model does a strong job of separating good customers from bad customers. In other words, the model ranks observations fairly well even before applying a cut-off.
pred_prob_test <- predict(glm_credit, newdata = GermanCredit_test, type = "response")
pred_class_test <- (pred_prob_test >= optimal.pcut) * 1
conf_test <- table(GermanCredit_test$Class, pred_class_test,
dnn = c("True", "Predicted"))
print(conf_test)
## Predicted
## True 0 1
## FALSE 28 43
## TRUE 12 117
MR_test <- mean(GermanCredit_test$Class != (pred_prob_test >= optimal.pcut))
print(paste("Test MR:", round(MR_test, 4)))
## [1] "Test MR: 0.275"
Your observation: Using the same cut-off of 0.39 on the test set, the model correctly classified 28 bad customers and 117 good customers, while misclassifying 43 bad customers as good and 12 good customers as bad. The test misclassification rate is 0.2750, meaning about 27.5% of test observations were classified incorrectly. Since this error rate is higher than the training MR, the model performs worse on new data than on the training set.
pred_test <- prediction(pred_prob_test, GermanCredit_test$Class)
ROC_test <- performance(pred_test, "tpr", "fpr")
plot(ROC_test, colorize = TRUE, main = "ROC Curve - Test")
auc_test <- performance(pred_test, "auc")
auc_test <- unlist(slot(auc_test, "y.values"))
print(paste("Test AUC:", round(auc_test, 4)))
## [1] "Test AUC: 0.7353"
Your observation: The test AUC is 0.7353, which is lower than the training AUC of 0.8505. This means the model still has acceptable predictive ability on unseen data, but its performance drops when applied to the test set.
Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!
costfunc_weighted <- function(obs, pred.p, pcut) {
weight_FN <- 5 # more expensive to miss a bad customer
weight_FP <- 1
pred_class <- (pred.p >= pcut)
FN <- (obs == TRUE) & (pred_class == FALSE)
FP <- (obs == FALSE) & (pred_class == TRUE)
cost <- mean(weight_FN * FN + weight_FP * FP)
return(cost)
}
MR_vec_w <- rep(0, length(pcut.seq))
for (i in 1:length(pcut.seq)) {
MR_vec_w[i] <- costfunc_weighted(obs = GermanCredit_train$Class,
pred.p = pred_prob_train,
pcut = pcut.seq[i])
}
optimal.pcut_w <- pcut.seq[which.min(MR_vec_w)]
print(paste("Optimal cut-off (weighted 5:1):", round(optimal.pcut_w, 3)))
## [1] "Optimal cut-off (weighted 5:1): 0.22"
Your observation:When false negatives are given a higher cost than false positives, the optimal cut-off decreases to 0.22. This lower threshold makes the model more likely to predict a customer as good, which helps reduce costly classification mistakes under the weighted setting. The change in cut-off shows how business priorities can directly affect classification decisions.
pred_class_train_w <- (pred_prob_train >= optimal.pcut_w) * 1
conf_train_w <- table(GermanCredit_train$Class, pred_class_train_w,
dnn = c("True", "Predicted"))
print(conf_train_w)
## Predicted
## True 0 1
## FALSE 41 188
## TRUE 4 567
weighted_MR_train <- costfunc_weighted(GermanCredit_train$Class, pred_prob_train, optimal.pcut_w)
print(paste("Weighted Training Cost:", round(weighted_MR_train, 4)))
## [1] "Weighted Training Cost: 0.26"
Your observation: Using the weighted cut-off of 0.22 on the training set, the model correctly classified 41 bad customers and 567 good customers, while misclassifying 188 bad customers as good and only 4 good customers as bad. The weighted training cost is 0.26.
pred_class_test_w <- (pred_prob_test >= optimal.pcut_w) * 1
conf_test_w <- table(GermanCredit_test$Class, pred_class_test_w,
dnn = c("True", "Predicted"))
print(conf_test_w)
## Predicted
## True 0 1
## FALSE 17 54
## TRUE 5 124
weighted_MR_test <- costfunc_weighted(GermanCredit_test$Class, pred_prob_test, optimal.pcut_w)
print(paste("Weighted Test Cost:", round(weighted_MR_test, 4)))
## [1] "Weighted Test Cost: 0.395"
Your observation: On the test set, the weighted cut-off of 0.22 produces a weighted cost of 0.395. The confusion matrix shows that the model correctly classified 17 bad customers and 124 good customers, but misclassified 54 bad customers as good and 5 good customers as bad.
Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model further.
-Overall, the logistic regression model performed reasonably well in predicting customer creditworthiness. Under equal weights, the optimal cut-off was 0.39, with a training MR of 0.1962, test MR of 0.2750, training AUC of 0.8505, and test AUC of 0.7353. These results show that the model has good predictive ability, although performance is weaker on the test set, suggesting some overfitting. When different error weights were used, the optimal cut-off dropped to 0.22, which changed the balance of classification errors and reflected the higher cost assigned to one type of mistake. To improve the model further, I would consider variable selection, checking multicollinearity, trying interaction terms, and comparing logistic regression with other classification methods.