Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: 62 variables
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.
str(GermanCredit) # structure - see variable type
## 'data.frame': 1000 obs. of 49 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
summary(GermanCredit) # summary statistics
## Duration Amount InstallmentRatePercentage ResidenceDuration
## Min. : 4.0 Min. : 250 Min. :1.000 Min. :1.000
## 1st Qu.:12.0 1st Qu.: 1366 1st Qu.:2.000 1st Qu.:2.000
## Median :18.0 Median : 2320 Median :3.000 Median :3.000
## Mean :20.9 Mean : 3271 Mean :2.973 Mean :2.845
## 3rd Qu.:24.0 3rd Qu.: 3972 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :72.0 Max. :18424 Max. :4.000 Max. :4.000
## Age NumberExistingCredits NumberPeopleMaintenance Telephone
## Min. :19.00 Min. :1.000 Min. :1.000 Min. :0.000
## 1st Qu.:27.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.000
## Median :33.00 Median :1.000 Median :1.000 Median :1.000
## Mean :35.55 Mean :1.407 Mean :1.155 Mean :0.596
## 3rd Qu.:42.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :75.00 Max. :4.000 Max. :2.000 Max. :1.000
## ForeignWorker Class CheckingAccountStatus.lt.0
## Min. :0.000 Mode :logical Min. :0.000
## 1st Qu.:1.000 FALSE:300 1st Qu.:0.000
## Median :1.000 TRUE :700 Median :0.000
## Mean :0.963 Mean :0.274
## 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.000 Max. :1.000
## CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.269 Mean :0.063
## 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
## Min. :0.00 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000
## Median :0.00 Median :0.000
## Mean :0.04 Mean :0.049
## 3rd Qu.:0.00 3rd Qu.:0.000
## Max. :1.00 Max. :1.000
## CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar Purpose.UsedCar
## Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :1.00 Median :0.000 Median :0.000 Median :0.000
## Mean :0.53 Mean :0.088 Mean :0.234 Mean :0.103
## 3rd Qu.:1.00 3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.00 Max. :1.000 Max. :1.000 Max. :1.000
## Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
## Min. :0.000 Min. :0.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
## Median :0.000 Median :0.00 Median :0.000
## Mean :0.181 Mean :0.28 Mean :0.012
## 3rd Qu.:0.000 3rd Qu.:1.00 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :1.000
## Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.00 Median :0.000 Median :0.000
## Mean :0.022 Mean :0.05 Mean :0.009 Mean :0.097
## 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.00 Max. :1.000 Max. :1.000
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :1.000 Median :0.000
## Mean :0.603 Mean :0.103
## 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000
## Mean :0.063 Mean :0.048
## 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.172 Mean :0.339 Mean :0.174
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:0.00
## Median :0.000 Median :0.00
## Mean :0.253 Mean :0.05
## 3rd Qu.:1.000 3rd Qu.:0.00
## Max. :1.000 Max. :1.00
## Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## Min. :0.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:1.000
## Median :0.00 Median :1.000 Median :1.000
## Mean :0.31 Mean :0.548 Mean :0.907
## 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.00 Max. :1.000 Max. :1.000
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.041 Mean :0.282 Mean :0.232
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.000 Median :0.000 Median :0.000
## Mean :0.332 Mean :0.139 Mean :0.047
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.000
## Max. :1.000 Max. :1.000 Max. :1.000
## Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.0
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.0
## Median :0.000 Median :1.000 Median :0.000 Median :0.0
## Mean :0.179 Mean :0.713 Mean :0.022 Mean :0.2
## 3rd Qu.:0.000 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.0
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.0
## Job.SkilledEmployee
## Min. :0.00
## 1st Qu.:0.00
## Median :1.00
## Mean :0.63
## 3rd Qu.:1.00
## Max. :1.00
colSums(is.na(GermanCredit))
## Duration Amount
## 0 0
## InstallmentRatePercentage ResidenceDuration
## 0 0
## Age NumberExistingCredits
## 0 0
## NumberPeopleMaintenance Telephone
## 0 0
## ForeignWorker Class
## 0 0
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 0 0
## CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 0 0
## CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly
## 0 0
## CreditHistory.Delay Purpose.NewCar
## 0 0
## Purpose.UsedCar Purpose.Furniture.Equipment
## 0 0
## Purpose.Radio.Television Purpose.DomesticAppliance
## 0 0
## Purpose.Repairs Purpose.Education
## 0 0
## Purpose.Retraining Purpose.Business
## 0 0
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 0 0
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 0 0
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4
## 0 0
## EmploymentDuration.4.to.7 EmploymentDuration.gt.7
## 0 0
## Personal.Male.Divorced.Seperated Personal.Female.NotSingle
## 0 0
## Personal.Male.Single OtherDebtorsGuarantors.None
## 0 0
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate
## 0 0
## Property.Insurance Property.CarOther
## 0 0
## OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 0 0
## Housing.Rent Housing.Own
## 0 0
## Job.UnemployedUnskilled Job.UnskilledResident
## 0 0
## Job.SkilledEmployee
## 0
Your observation: after removing the variables with no data there are 49 variables with no missing data. id say there are too many variables with little information but we will see if they are significant once we conduct some tests.
2024 for reproducibility.
(10pts)set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
credit_train = GermanCredit[index,]
credit_test = GermanCredit[-index,]
Your observation: split 80-20
glm_credit<- glm(Class~., family=binomial, data=credit_train)
Your observation:
summary(glm_credit)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = credit_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.241e+00 1.719e+00 5.376 7.61e-08 ***
## Duration -2.994e-02 1.072e-02 -2.794 0.005214 **
## Amount -1.771e-04 5.095e-05 -3.475 0.000510 ***
## InstallmentRatePercentage -3.718e-01 1.036e-01 -3.589 0.000332 ***
## ResidenceDuration 2.577e-02 1.010e-01 0.255 0.798510
## Age 1.183e-02 1.097e-02 1.078 0.280974
## NumberExistingCredits -1.225e-01 2.189e-01 -0.560 0.575690
## NumberPeopleMaintenance -1.731e-01 2.945e-01 -0.588 0.556678
## Telephone -4.236e-01 2.371e-01 -1.786 0.074081 .
## ForeignWorker -1.651e+00 7.421e-01 -2.224 0.026143 *
## CheckingAccountStatus.lt.0 -1.817e+00 2.710e-01 -6.703 2.04e-11 ***
## CheckingAccountStatus.0.to.200 -1.432e+00 2.686e-01 -5.330 9.81e-08 ***
## CheckingAccountStatus.gt.200 -5.912e-01 4.631e-01 -1.277 0.201696
## CreditHistory.NoCredit.AllPaid -8.724e-01 5.139e-01 -1.698 0.089584 .
## CreditHistory.ThisBank.AllPaid -1.676e+00 5.493e-01 -3.052 0.002277 **
## CreditHistory.PaidDuly -6.686e-01 2.939e-01 -2.275 0.022899 *
## CreditHistory.Delay -9.413e-01 3.780e-01 -2.491 0.012756 *
## Purpose.NewCar -1.733e+00 1.013e+00 -1.710 0.087282 .
## Purpose.UsedCar 6.716e-02 1.033e+00 0.065 0.948146
## Purpose.Furniture.Equipment -8.257e-01 1.015e+00 -0.814 0.415816
## Purpose.Radio.Television -8.386e-01 1.019e+00 -0.823 0.410457
## Purpose.DomesticAppliance -1.227e+00 1.328e+00 -0.923 0.355762
## Purpose.Repairs -1.321e+00 1.165e+00 -1.134 0.256825
## Purpose.Education -2.020e+00 1.088e+00 -1.857 0.063374 .
## Purpose.Retraining 4.276e-01 1.640e+00 0.261 0.794237
## Purpose.Business -8.618e-01 1.032e+00 -0.835 0.403529
## SavingsAccountBonds.lt.100 -1.266e+00 3.201e-01 -3.956 7.63e-05 ***
## SavingsAccountBonds.100.to.500 -1.075e+00 4.171e-01 -2.577 0.009964 **
## SavingsAccountBonds.500.to.1000 -8.768e-01 5.216e-01 -1.681 0.092761 .
## SavingsAccountBonds.gt.1000 1.301e-02 6.161e-01 0.021 0.983157
## EmploymentDuration.lt.1 3.581e-01 5.167e-01 0.693 0.488195
## EmploymentDuration.1.to.4 5.527e-01 5.000e-01 1.105 0.268967
## EmploymentDuration.4.to.7 9.863e-01 5.355e-01 1.842 0.065524 .
## EmploymentDuration.gt.7 5.253e-01 5.039e-01 1.042 0.297218
## Personal.Male.Divorced.Seperated -2.546e-01 5.214e-01 -0.488 0.625274
## Personal.Female.NotSingle -1.274e-01 3.573e-01 -0.357 0.721452
## Personal.Male.Single 4.118e-01 3.623e-01 1.137 0.255622
## OtherDebtorsGuarantors.None -1.239e+00 5.370e-01 -2.308 0.021018 *
## OtherDebtorsGuarantors.CoApplicant -1.565e+00 6.828e-01 -2.292 0.021919 *
## Property.RealEstate 7.166e-01 4.898e-01 1.463 0.143477
## Property.Insurance 3.544e-01 4.785e-01 0.741 0.458926
## Property.CarOther 6.110e-01 4.648e-01 1.314 0.188702
## OtherInstallmentPlans.Bank -8.504e-01 2.730e-01 -3.115 0.001838 **
## OtherInstallmentPlans.Stores -4.293e-01 4.711e-01 -0.911 0.362139
## Housing.Rent -9.538e-01 5.624e-01 -1.696 0.089924 .
## Housing.Own -2.723e-01 5.282e-01 -0.516 0.606157
## Job.UnemployedUnskilled 1.449e+00 8.788e-01 1.649 0.099175 .
## Job.UnskilledResident -2.641e-03 4.101e-01 -0.006 0.994861
## Job.SkilledEmployee -1.073e-02 3.349e-01 -0.032 0.974438
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 958.02 on 799 degrees of freedom
## Residual deviance: 672.78 on 751 degrees of freedom
## AIC: 770.78
##
## Number of Fisher Scoring iterations: 5
Your observation: there seem to be many variables that are significant such as duration, amount, installment rate, and having low checking balance. these variables tend to increase the chance of becoming a credit risk
train_probs <- predict(glm_credit, newdata = credit_train, type = "response")
Your observation:
costfunc <- function(obs, pred.p, pcut) {
weight_FN = 1 # Define the weight for "true=1 but pred=0" (FN)
weight_FP = 1 # Define the weight for "true=0 but pred=1" (FP)
FNC <- sum((obs == TRUE) & (pred.p < pcut), na.rm = TRUE)
FPC <- sum((obs == FALSE) & (pred.p >= pcut), na.rm = TRUE)
MR <- sum(weight_FN * FNC + weight_FP * FPC) / length(obs)
if (is.na(MR) || !is.finite(MR)) {
MR <- 1
}
return(MR)
}
# Define a sequence of cutoff values from 0.01 to 1, by 0.01
pcut.seq <- seq(0.01, 1, by = 0.01)
MR_vec <- rep(0, length(pcut.seq))
for(i in 1:length(pcut.seq)) {
MR_vec[i] <- costfunc(obs = credit_train$class, pred.p = credit_train, pcut = pcut.seq[i])
}
cbind(pcut.seq, MR_vec)
## pcut.seq MR_vec
## [1,] 0.01 1
## [2,] 0.02 1
## [3,] 0.03 1
## [4,] 0.04 1
## [5,] 0.05 1
## [6,] 0.06 1
## [7,] 0.07 1
## [8,] 0.08 1
## [9,] 0.09 1
## [10,] 0.10 1
## [11,] 0.11 1
## [12,] 0.12 1
## [13,] 0.13 1
## [14,] 0.14 1
## [15,] 0.15 1
## [16,] 0.16 1
## [17,] 0.17 1
## [18,] 0.18 1
## [19,] 0.19 1
## [20,] 0.20 1
## [21,] 0.21 1
## [22,] 0.22 1
## [23,] 0.23 1
## [24,] 0.24 1
## [25,] 0.25 1
## [26,] 0.26 1
## [27,] 0.27 1
## [28,] 0.28 1
## [29,] 0.29 1
## [30,] 0.30 1
## [31,] 0.31 1
## [32,] 0.32 1
## [33,] 0.33 1
## [34,] 0.34 1
## [35,] 0.35 1
## [36,] 0.36 1
## [37,] 0.37 1
## [38,] 0.38 1
## [39,] 0.39 1
## [40,] 0.40 1
## [41,] 0.41 1
## [42,] 0.42 1
## [43,] 0.43 1
## [44,] 0.44 1
## [45,] 0.45 1
## [46,] 0.46 1
## [47,] 0.47 1
## [48,] 0.48 1
## [49,] 0.49 1
## [50,] 0.50 1
## [51,] 0.51 1
## [52,] 0.52 1
## [53,] 0.53 1
## [54,] 0.54 1
## [55,] 0.55 1
## [56,] 0.56 1
## [57,] 0.57 1
## [58,] 0.58 1
## [59,] 0.59 1
## [60,] 0.60 1
## [61,] 0.61 1
## [62,] 0.62 1
## [63,] 0.63 1
## [64,] 0.64 1
## [65,] 0.65 1
## [66,] 0.66 1
## [67,] 0.67 1
## [68,] 0.68 1
## [69,] 0.69 1
## [70,] 0.70 1
## [71,] 0.71 1
## [72,] 0.72 1
## [73,] 0.73 1
## [74,] 0.74 1
## [75,] 0.75 1
## [76,] 0.76 1
## [77,] 0.77 1
## [78,] 0.78 1
## [79,] 0.79 1
## [80,] 0.80 1
## [81,] 0.81 1
## [82,] 0.82 1
## [83,] 0.83 1
## [84,] 0.84 1
## [85,] 0.85 1
## [86,] 0.86 1
## [87,] 0.87 1
## [88,] 0.88 1
## [89,] 0.89 1
## [90,] 0.90 1
## [91,] 0.91 1
## [92,] 0.92 1
## [93,] 0.93 1
## [94,] 0.94 1
## [95,] 0.95 1
## [96,] 0.96 1
## [97,] 0.97 1
## [98,] 0.98 1
## [99,] 0.99 1
## [100,] 1.00 1
plot(pcut.seq, MR_vec, type = "b", main = "Misclassification Rate by Cutoff",
xlab = "Cutoff (pcut)", ylab = "Misclassification Rate (MR)")
# Find the optimal cutoff with the minimum misclassification rate
optimal.pcut <- pcut.seq[which.min(MR_vec)]
print(optimal.pcut)
## [1] 0.01
Your observation: i got a 0.1 cutoff which means we only need 10% prob for an observation to be positive
pred_prob_credit_train <- predict(glm_credit, newdata = credit_train, type = "response")
class.glm_opt.train <- ifelse(pred_prob_credit_train >= 0.1, TRUE, FALSE)
table(credit_train$Class, class.glm_opt.train, dnn = c("True", "Predicted"))
## Predicted
## True FALSE TRUE
## FALSE 16 213
## TRUE 1 570
confusion_matrix_train <- table(credit_train$Class, class.glm_opt.train, dnn = c("True", "Predicted"))
print(confusion_matrix_train)
## Predicted
## True FALSE TRUE
## FALSE 16 213
## TRUE 1 570
MR_train <- sum(confusion_matrix_train[1, 2] + confusion_matrix_train[2, 1]) / sum(confusion_matrix_train)
print(paste("Misclassification Rate (MR) for Training Set:", MR_train))
## [1] "Misclassification Rate (MR) for Training Set: 0.2675"
Your observation: an mr of 0.2675 means that 26.75% of the predictions were wrong. so 73.25% accuracy thats not bad but it could be higher
library(ROCR)
pred_train <- prediction(pred_prob_credit_train, credit_train$Class)
ROC <- performance(pred_train, "tpr", "fpr")
plot(ROC, colorize = TRUE, main = "ROC")
auc_train <- unlist(slot(performance(pred_train, "auc"), "y.values"))
print(paste("AUC", auc_train))
## [1] "AUC 0.850480655251264"
Your observation: with an AUC of 0.8504 that means that the model predicts the risk classes 85% correctly which is effective. the roc was around 73% and AUC was 85% so i think the model could slightly improve
pred_prob_credit_test <- predict(glm_credit, newdata = credit_test, type = "response")
pred_class_credit_test_optimal <- ifelse(pred_prob_credit_test > 0.1, TRUE, FALSE)
conf_test <- table(credit_test$Class, pred_class_credit_test_optimal, dnn = c("True", "Predicted"))
print(conf_test)
## Predicted
## True FALSE TRUE
## FALSE 6 65
## TRUE 3 126
MR_test <- 1 - sum(diag(conf_test)) / sum(conf_test)
print(paste("Misclassification Rate (MR) for Test Set:", MR_test))
## [1] "Misclassification Rate (MR) for Test Set: 0.34"
Your observation: MR meas a 34% incorrect rate which is a increase from the training set it seems that there may be some over fitting in the model
pred_test <- prediction(pred_prob_credit_test, credit_test$Class)
ROC_test <- performance(pred_test, "tpr", "fpr")
plot(ROC_test, colorize = TRUE, main = "ROC test")
auc_test <- unlist(slot(performance(pred_test, "auc"), "y.values"))
print(paste("AUC test", auc_test))
## [1] "AUC test 0.735342286275794"
Your observation: the AUC is 73.53% which is still acceptable but is lower that the training which means that the model could be improved.
Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!
# Redefine the cost function with updated weights (more weight on FN)
costfunc <- function(obs, pred.p, pcut) {
weight_FN = 5 # Higher weight for "true=1 but pred=0" (FN)
weight_FP = 1 # Lower weight for "true=0 but pred=1" (FP)
# Count FN and FP based on the cutoff
FNC <- sum((obs == 1) & (pred.p < pcut)) # FN count
FPC <- sum((obs == 0) & (pred.p >= pcut)) # FP count
# Weighted misclassification rate
MR <- sum(weight_FN * FNC + weight_FP * FPC) / length(obs)
return(MR)
}
# Define a sequence of cutoff values from 0.01 to 1, in steps of 0.01
pcut.seq <- seq(0.01, 1, by = 0.01)
# Calculate the weighted misclassification rate (MR) for each cutoff
MR_vec <- sapply(pcut.seq, function(cutoff) {
costfunc(obs = credit_train$Class, pred.p = pred_prob_credit_train, pcut = cutoff)
})
# Find the optimal cutoff with the minimum weighted misclassification rate
optimal_cutoff <- pcut.seq[which.min(MR_vec)]
print(paste("Optimal Cutoff with Weights:", optimal_cutoff))
## [1] "Optimal Cutoff with Weights: 0.22"
Your observation: with a weight change the model now puts more value on making a false negative as its more costly this better prevent false negatives but can lead to more false positives
pred_class_credit_train_optimal <- ifelse(pred_prob_credit_train > 0.22, TRUE, FALSE)
conf_train <- table(credit_train$Class, pred_class_credit_train_optimal, dnn = c("True", "Predicted"))
print(conf_train)
## Predicted
## True FALSE TRUE
## FALSE 41 188
## TRUE 4 567
MR_train <- costfunc(obs = credit_train$Class, pred.p = pred_prob_credit_train, pcut = 0.22)
print(paste("Weighted MR for Training Set:", MR_train))
## [1] "Weighted MR for Training Set: 0.26"
Your observation: the new MR 0.26 is almost the same but it slightly increased that means the model has slightly stricter conditions of a non credit risk
pred_prob_credit_test <- predict(glm_credit, newdata = credit_test, type = "response")
pred_class_credit_test_optimal <- ifelse(pred_prob_credit_test > 0.22, TRUE, FALSE)
conf_test <- table(credit_test$Class, pred_class_credit_test_optimal, dnn = c("True", "Predicted"))
print(conf_test)
## Predicted
## True FALSE TRUE
## FALSE 17 54
## TRUE 5 124
MR_test <- costfunc(obs = credit_test$Class, pred.p = pred_prob_credit_test, pcut = 0.22)
print(paste("Weighted MR for Test Set:", MR_test))
## [1] "Weighted MR for Test Set: 0.395"
Your observation: the new MR 0.395 increased from 0.34 which means its trying to prevent more false negatives but it leads to a higher MR
Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model.
In this analysis, we aimed to develop a model that balances accuracy with the cost of misclassifications, prioritizing false negatives (misclassifying a bad customer as good) as five times more costly than false positives. With equal weights on false positives and false negatives. the optimal probability cutoff was 0.1, with an MR of 0.2675 and a test MR of 0.34, with AUCs of 0.8504 training and 0.7353 test, indicating good discrimination ability on both.
With the weighted cost structure, the optimal cutoff shifted to 0.22, prioritizing caution in classifying customers as good. This resulted in a weighted MR of 0.26 for the training set and 0.395 for the test set, showing a trade-off where the model became more conservative but less generalizable. To improve the model we need keep it conservative but make it more accurate maybe by removing some of the less significant variables and more testing