Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation: The dataset has 1,000 rows and 62 variables, mostly numeric or binary. The target variable Class shows whether a customer has good credit (TRUE) or not (FALSE).
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.
colSums(is.na(GermanCredit))
## Duration Amount
## 0 0
## InstallmentRatePercentage ResidenceDuration
## 0 0
## Age NumberExistingCredits
## 0 0
## NumberPeopleMaintenance Telephone
## 0 0
## ForeignWorker Class
## 0 0
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 0 0
## CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 0 0
## CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly
## 0 0
## CreditHistory.Delay Purpose.NewCar
## 0 0
## Purpose.UsedCar Purpose.Furniture.Equipment
## 0 0
## Purpose.Radio.Television Purpose.DomesticAppliance
## 0 0
## Purpose.Repairs Purpose.Education
## 0 0
## Purpose.Retraining Purpose.Business
## 0 0
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 0 0
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 0 0
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4
## 0 0
## EmploymentDuration.4.to.7 EmploymentDuration.gt.7
## 0 0
## Personal.Male.Divorced.Seperated Personal.Female.NotSingle
## 0 0
## Personal.Male.Single OtherDebtorsGuarantors.None
## 0 0
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate
## 0 0
## Property.Insurance Property.CarOther
## 0 0
## OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 0 0
## Housing.Rent Housing.Own
## 0 0
## Job.UnemployedUnskilled Job.UnskilledResident
## 0 0
## Job.SkilledEmployee
## 0
sapply(GermanCredit, class)
## Duration Amount
## "integer" "integer"
## InstallmentRatePercentage ResidenceDuration
## "integer" "integer"
## Age NumberExistingCredits
## "integer" "integer"
## NumberPeopleMaintenance Telephone
## "integer" "numeric"
## ForeignWorker Class
## "numeric" "logical"
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## "numeric" "numeric"
## CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## "numeric" "numeric"
## CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly
## "numeric" "numeric"
## CreditHistory.Delay Purpose.NewCar
## "numeric" "numeric"
## Purpose.UsedCar Purpose.Furniture.Equipment
## "numeric" "numeric"
## Purpose.Radio.Television Purpose.DomesticAppliance
## "numeric" "numeric"
## Purpose.Repairs Purpose.Education
## "numeric" "numeric"
## Purpose.Retraining Purpose.Business
## "numeric" "numeric"
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## "numeric" "numeric"
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## "numeric" "numeric"
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4
## "numeric" "numeric"
## EmploymentDuration.4.to.7 EmploymentDuration.gt.7
## "numeric" "numeric"
## Personal.Male.Divorced.Seperated Personal.Female.NotSingle
## "numeric" "numeric"
## Personal.Male.Single OtherDebtorsGuarantors.None
## "numeric" "numeric"
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate
## "numeric" "numeric"
## Property.Insurance Property.CarOther
## "numeric" "numeric"
## OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## "numeric" "numeric"
## Housing.Rent Housing.Own
## "numeric" "numeric"
## Job.UnemployedUnskilled Job.UnskilledResident
## "numeric" "numeric"
## Job.SkilledEmployee
## "numeric"
Your observation: There are no missing values in the dataset. Most variables are numeric, while Class is logical, meaning the data is clean and ready for modeling.
2024 for reproducibility.
(10pts)set.seed(2024)
trainIndex <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)
train <- GermanCredit[trainIndex, ]
test <- GermanCredit[-trainIndex, ]
#quick check
prop.table(table(train$Class))
##
## FALSE TRUE
## 0.3 0.7
prop.table(table(test$Class))
##
## FALSE TRUE
## 0.3 0.7
Your observation: The data was split into 70% training and 30% testing, keeping the same class balance of about 30% bad and 70% good credit cases.
train$Class <- factor(train$Class, levels = c("FALSE","TRUE"))
test$Class <- factor(test$Class, levels = c("FALSE","TRUE"))
lrm_all <- glm(Class ~ ., data = train, family = binomial(link = "logit"))
summary(lrm_all)
##
## Call:
## glm(formula = Class ~ ., family = binomial(link = "logit"), data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.7755921 1.7594925 5.556 2.76e-08 ***
## Duration -0.0281752 0.0114559 -2.459 0.013915 *
## Amount -0.0001968 0.0000580 -3.394 0.000690 ***
## InstallmentRatePercentage -0.3458012 0.1122102 -3.082 0.002058 **
## ResidenceDuration -0.1477247 0.1099835 -1.343 0.179222
## Age -0.0011930 0.0111092 -0.107 0.914479
## NumberExistingCredits -0.1741853 0.2247245 -0.775 0.438277
## NumberPeopleMaintenance -0.2953842 0.3033517 -0.974 0.330188
## Telephone -0.8357009 0.2619015 -3.191 0.001418 **
## ForeignWorker -1.6606566 0.8122576 -2.044 0.040905 *
## CheckingAccountStatus.lt.0 -2.0280291 0.2899845 -6.994 2.68e-12 ***
## CheckingAccountStatus.0.to.200 -1.4706478 0.2943908 -4.996 5.87e-07 ***
## CheckingAccountStatus.gt.200 -0.6052653 0.4876931 -1.241 0.214577
## CreditHistory.NoCredit.AllPaid -1.2639798 0.5155113 -2.452 0.014211 *
## CreditHistory.ThisBank.AllPaid -1.8780235 0.5706646 -3.291 0.000999 ***
## CreditHistory.PaidDuly -0.8775997 0.3159046 -2.778 0.005469 **
## CreditHistory.Delay -0.4012640 0.4307837 -0.931 0.351608
## Purpose.NewCar -1.0626620 0.8142904 -1.305 0.191887
## Purpose.UsedCar 1.1942539 0.8839916 1.351 0.176702
## Purpose.Furniture.Equipment -0.1681192 0.8320966 -0.202 0.839883
## Purpose.Radio.Television -0.3031554 0.8286036 -0.366 0.714467
## Purpose.DomesticAppliance -0.7371787 1.2321421 -0.598 0.549646
## Purpose.Repairs -0.8575710 0.9887784 -0.867 0.385776
## Purpose.Education -0.6848705 0.9364025 -0.731 0.464544
## Purpose.Retraining -0.1649183 1.5465838 -0.107 0.915079
## Purpose.Business -0.3600823 0.8535288 -0.422 0.673116
## SavingsAccountBonds.lt.100 -0.9786195 0.3127225 -3.129 0.001752 **
## SavingsAccountBonds.100.to.500 -0.9669534 0.4406228 -2.195 0.028198 *
## SavingsAccountBonds.500.to.1000 -0.2529878 0.5442721 -0.465 0.642061
## SavingsAccountBonds.gt.1000 0.2713176 0.6594268 0.411 0.680747
## EmploymentDuration.lt.1 -0.4435735 0.5345880 -0.830 0.406681
## EmploymentDuration.1.to.4 -0.4275141 0.5069023 -0.843 0.399013
## EmploymentDuration.4.to.7 0.4416798 0.5618787 0.786 0.431822
## EmploymentDuration.gt.7 -0.2520532 0.5037635 -0.500 0.616835
## Personal.Male.Divorced.Seperated -0.4301280 0.5538492 -0.777 0.437385
## Personal.Female.NotSingle -0.0179029 0.3950224 -0.045 0.963851
## Personal.Male.Single 0.6299901 0.3971902 1.586 0.112713
## OtherDebtorsGuarantors.None -1.0309812 0.5142560 -2.005 0.044984 *
## OtherDebtorsGuarantors.CoApplicant -1.0727811 0.7201303 -1.490 0.136302
## Property.RealEstate 1.2295999 0.5185315 2.371 0.017725 *
## Property.Insurance 0.8935212 0.5097800 1.753 0.079643 .
## Property.CarOther 1.1356001 0.5048681 2.249 0.024493 *
## OtherInstallmentPlans.Bank -0.6436463 0.3046547 -2.113 0.034626 *
## OtherInstallmentPlans.Stores -0.2405278 0.4731218 -0.508 0.611184
## Housing.Rent -0.7041915 0.5817432 -1.210 0.226093
## Housing.Own -0.5109041 0.5552490 -0.920 0.357502
## Job.UnemployedUnskilled 0.4681174 0.8091298 0.579 0.562897
## Job.UnskilledResident 0.3450109 0.4498926 0.767 0.443156
## Job.SkilledEmployee 0.1604813 0.3719210 0.431 0.666110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 855.21 on 699 degrees of freedom
## Residual deviance: 595.77 on 651 degrees of freedom
## AIC: 693.77
##
## Number of Fisher Scoring iterations: 5
Your observation: The logistic regression model fits well with several significant predictors. Variables like Amount, Duration, CheckingAccountStatus, and Telephone have strong effects on credit risk, while some variables were excluded due to multicollinearity.
coefs <- summary(lrm_all)$coefficients
odds_ratios <- exp(coef(lrm_all))
cbind(coefs, OddsRatio = odds_ratios)
## Estimate Std. Error z value
## (Intercept) 9.7755920697 1.759492e+00 5.55591574
## Duration -0.0281751864 1.145588e-02 -2.45945116
## Amount -0.0001968268 5.799757e-05 -3.39370661
## InstallmentRatePercentage -0.3458012342 1.122102e-01 -3.08172646
## ResidenceDuration -0.1477246581 1.099835e-01 -1.34315321
## Age -0.0011930223 1.110923e-02 -0.10739018
## NumberExistingCredits -0.1741853072 2.247245e-01 -0.77510593
## NumberPeopleMaintenance -0.2953842316 3.033517e-01 -0.97373507
## Telephone -0.8357008957 2.619015e-01 -3.19089796
## ForeignWorker -1.6606565630 8.122576e-01 -2.04449491
## CheckingAccountStatus.lt.0 -2.0280291351 2.899845e-01 -6.99357666
## CheckingAccountStatus.0.to.200 -1.4706478442 2.943908e-01 -4.99556266
## CheckingAccountStatus.gt.200 -0.6052653383 4.876931e-01 -1.24107823
## CreditHistory.NoCredit.AllPaid -1.2639798373 5.155113e-01 -2.45189546
## CreditHistory.ThisBank.AllPaid -1.8780234743 5.706646e-01 -3.29094092
## CreditHistory.PaidDuly -0.8775997275 3.159046e-01 -2.77805295
## CreditHistory.Delay -0.4012639884 4.307837e-01 -0.93147437
## Purpose.NewCar -1.0626619618 8.142904e-01 -1.30501598
## Purpose.UsedCar 1.1942539313 8.839916e-01 1.35097883
## Purpose.Furniture.Equipment -0.1681192131 8.320966e-01 -0.20204290
## Purpose.Radio.Television -0.3031553848 8.286036e-01 -0.36586297
## Purpose.DomesticAppliance -0.7371786695 1.232142e+00 -0.59829029
## Purpose.Repairs -0.8575710181 9.887784e-01 -0.86730357
## Purpose.Education -0.6848705104 9.364025e-01 -0.73138471
## Purpose.Retraining -0.1649182660 1.546584e+00 -0.10663390
## Purpose.Business -0.3600823059 8.535288e-01 -0.42187482
## SavingsAccountBonds.lt.100 -0.9786194558 3.127225e-01 -3.12935373
## SavingsAccountBonds.100.to.500 -0.9669533652 4.406228e-01 -2.19451492
## SavingsAccountBonds.500.to.1000 -0.2529877648 5.442721e-01 -0.46481853
## SavingsAccountBonds.gt.1000 0.2713175773 6.594268e-01 0.41144460
## EmploymentDuration.lt.1 -0.4435735258 5.345880e-01 -0.82974836
## EmploymentDuration.1.to.4 -0.4275141090 5.069024e-01 -0.84338553
## EmploymentDuration.4.to.7 0.4416798369 5.618787e-01 0.78607684
## EmploymentDuration.gt.7 -0.2520532431 5.037635e-01 -0.50034043
## Personal.Male.Divorced.Seperated -0.4301280237 5.538492e-01 -0.77661578
## Personal.Female.NotSingle -0.0179028535 3.950224e-01 -0.04532111
## Personal.Male.Single 0.6299900791 3.971902e-01 1.58611688
## OtherDebtorsGuarantors.None -1.0309812406 5.142559e-01 -2.00480178
## OtherDebtorsGuarantors.CoApplicant -1.0727811145 7.201303e-01 -1.48970425
## Property.RealEstate 1.2295998649 5.185315e-01 2.37131168
## Property.Insurance 0.8935212291 5.097800e-01 1.75275860
## Property.CarOther 1.1356000792 5.048681e-01 2.24930068
## OtherInstallmentPlans.Bank -0.6436463267 3.046547e-01 -2.11270803
## OtherInstallmentPlans.Stores -0.2405277566 4.731218e-01 -0.50838439
## Housing.Rent -0.7041915492 5.817432e-01 -1.21048523
## Housing.Own -0.5109041125 5.552490e-01 -0.92013510
## Job.UnemployedUnskilled 0.4681173901 8.091298e-01 0.57854424
## Job.UnskilledResident 0.3450109181 4.498926e-01 0.76687392
## Job.SkilledEmployee 0.1604813103 3.719210e-01 0.43149304
## Pr(>|z|) OddsRatio
## (Intercept) 2.761601e-08 1.759891e+04
## Duration 1.391496e-02 9.722180e-01
## Amount 6.895352e-04 9.998032e-01
## InstallmentRatePercentage 2.058039e-03 7.076531e-01
## ResidenceDuration 1.792224e-01 8.626686e-01
## Age 9.144794e-01 9.988077e-01
## NumberExistingCredits 4.382771e-01 8.401412e-01
## NumberPeopleMaintenance 3.301881e-01 7.442456e-01
## Telephone 1.418314e-03 4.335705e-01
## ForeignWorker 4.090468e-02 1.900142e-01
## CheckingAccountStatus.lt.0 2.679653e-12 1.315946e-01
## CheckingAccountStatus.0.to.200 5.866447e-07 2.297766e-01
## CheckingAccountStatus.gt.200 2.145769e-01 5.459296e-01
## CreditHistory.NoCredit.AllPaid 1.421059e-02 2.825274e-01
## CreditHistory.ThisBank.AllPaid 9.985288e-04 1.528920e-01
## CreditHistory.PaidDuly 5.468571e-03 4.157797e-01
## CreditHistory.Delay 3.516082e-01 6.694733e-01
## Purpose.NewCar 1.918874e-01 3.455348e-01
## Purpose.UsedCar 1.767022e-01 3.301094e+00
## Purpose.Furniture.Equipment 8.398832e-01 8.452531e-01
## Purpose.Radio.Television 7.144673e-01 7.384843e-01
## Purpose.DomesticAppliance 5.496463e-01 4.784619e-01
## Purpose.Repairs 3.857757e-01 4.241912e-01
## Purpose.Education 4.645442e-01 5.041555e-01
## Purpose.Retraining 9.150794e-01 8.479630e-01
## Purpose.Business 6.731164e-01 6.976189e-01
## SavingsAccountBonds.lt.100 1.751913e-03 3.758296e-01
## SavingsAccountBonds.100.to.500 2.819841e-02 3.802397e-01
## SavingsAccountBonds.500.to.1000 6.420614e-01 7.764774e-01
## SavingsAccountBonds.gt.1000 6.807466e-01 1.311692e+00
## EmploymentDuration.lt.1 4.066811e-01 6.417390e-01
## EmploymentDuration.1.to.4 3.990129e-01 6.521282e-01
## EmploymentDuration.4.to.7 4.318225e-01 1.555318e+00
## EmploymentDuration.gt.7 6.168354e-01 7.772034e-01
## Personal.Male.Divorced.Seperated 4.373855e-01 6.504258e-01
## Personal.Female.NotSingle 9.638514e-01 9.822565e-01
## Personal.Male.Single 1.127128e-01 1.877592e+00
## OtherDebtorsGuarantors.None 4.498424e-02 3.566568e-01
## OtherDebtorsGuarantors.CoApplicant 1.363020e-01 3.420559e-01
## Property.RealEstate 1.772508e-02 3.419861e+00
## Property.Insurance 7.964345e-02 2.443719e+00
## Property.CarOther 2.449337e-02 3.113041e+00
## OtherInstallmentPlans.Bank 3.462577e-02 5.253732e-01
## OtherInstallmentPlans.Stores 6.111838e-01 7.862128e-01
## Housing.Rent 2.260928e-01 4.945082e-01
## Housing.Own 3.575022e-01 5.999529e-01
## Job.UnemployedUnskilled 5.628967e-01 1.596985e+00
## Job.UnskilledResident 4.431565e-01 1.412005e+00
## Job.SkilledEmployee 6.661099e-01 1.174076e+00
Your observation: Important predictors include CheckingAccountStatus, Amount, and CreditHistory, which strongly affect credit risk. For example, lower account status greatly reduces the odds of being a good credit risk.
train$prob <- predict(lrm_all, newdata = train, type = "response")
Your observation:
get_MR <- function(threshold, obs, prob) {
pred <- ifelse(prob >= threshold, "TRUE", "FALSE")
cm <- table(Predicted = pred, Actual = obs)
#misclassification rate
MR <- 1 - sum(diag(cm)) / sum(cm)
return(list(MR = MR, cm = cm))
}
thresholds <- seq(0, 1, by = 0.001)
mr_vals <- sapply(thresholds, function(t) get_MR(t, train$Class, train$prob)$MR)
best_idx <- which.min(mr_vals)
best_threshold_equal <- thresholds[best_idx]
best_threshold_equal
## [1] 0.411
min(mr_vals)
## [1] 0.19
# Show confusion matrix at best threshold
best_result_equal <- get_MR(best_threshold_equal, train$Class, train$prob)
best_result_equal$cm
## Actual
## Predicted FALSE TRUE
## FALSE 104 27
## TRUE 106 463
Your observation: The optimal probability cutoff is around 0.41, giving a misclassification rate of about 19%. The model correctly classifies most good credit cases but misses some bad ones.
train_pred_equal <- factor(ifelse(train$prob >= best_threshold_equal, "TRUE", "FALSE"), levels = c("FALSE","TRUE"))
conf_train_equal <- confusionMatrix(train_pred_equal, train$Class, positive = "TRUE")
conf_train_equal$table
## Reference
## Prediction FALSE TRUE
## FALSE 104 27
## TRUE 106 463
conf_train_equal$overall["Accuracy"]
## Accuracy
## 0.81
# MR = 1 - Accuracy
MR_train_equal <- 1 - conf_train_equal$overall["Accuracy"]
MR_train_equal
## Accuracy
## 0.19
Your observation: The model achieved about 81% accuracy and a 19% misclassification rate on the training set, showing a fairly good fit.
if (!require(pROC, quietly = TRUE)) {
install.packages("pROC")
}
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(pROC)
roc_train <- roc(response = as.numeric(train$Class == "TRUE"), predictor = train$prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_train <- auc(roc_train)
auc_train
## Area under the curve: 0.8497
Your observation: The model’s AUC is 0.85, showing good ability to distinguish between good and bad credit applicants.
test$prob <- predict(lrm_all, newdata = test, type = "response")
test_pred_equal <- factor(ifelse(test$prob >= best_threshold_equal, "TRUE", "FALSE"), levels = c("FALSE","TRUE"))
conf_test_equal <- confusionMatrix(test_pred_equal, test$Class, positive = "TRUE")
conf_test_equal$table
## Reference
## Prediction FALSE TRUE
## FALSE 30 22
## TRUE 60 188
MR_test_equal <- 1 - conf_test_equal$overall["Accuracy"]
MR_test_equal
## Accuracy
## 0.2733333
Your observation: The model’s accuracy dropped to 0.27, meaning it’s now performing poorly at correctly predicting outcomes.
roc_test <- roc(response = as.numeric(test$Class == "TRUE"), predictor = test$prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_test <- auc(roc_test)
auc_test
## Area under the curve: 0.7562
Your observation: The model’s AUC is 0.76, showing a moderate ability to distinguish between good and bad credit cases.
Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!
weight_FP <- 5 # cost of predicted = TRUE when actual = FALSE
weight_FN <- 1 # cost of predicted = FALSE when actual = TRUE
get_weighted_cost <- function(threshold, obs, prob, wFP, wFN) {
pred <- ifelse(prob >= threshold, "TRUE", "FALSE")
# counts:
TP <- sum(pred == "TRUE" & obs == "TRUE")
TN <- sum(pred == "FALSE" & obs == "FALSE")
FP <- sum(pred == "TRUE" & obs == "FALSE")
FN <- sum(pred == "FALSE" & obs == "TRUE")
# weighted cost (normalized by n to be comparable to MR)
wcost <- (wFP * FP + wFN * FN) / (TP + TN + FP + FN)
return(list(wcost = wcost, conf = c(TP = TP, TN = TN, FP = FP, FN = FN)))
}
wcosts <- sapply(thresholds, function(t) get_weighted_cost(t, train$Class, train$prob, weight_FP, weight_FN)$wcost)
best_idx_w <- which.min(wcosts)
best_threshold_weighted <- thresholds[best_idx_w]
best_threshold_weighted
## [1] 0.837
min(wcosts)
## [1] 0.4414286
# Confusion matrix at weighted best threshold (training)
weighted_best <- get_weighted_cost(best_threshold_weighted, train$Class, train$prob, weight_FP, weight_FN)
weighted_best
## $wcost
## [1] 0.4414286
##
## $conf
## TP TN FP FN
## 281 190 20 209
Your observation: The model correctly identified most positive cases (TP = 281) but missed several (FN = 209). The weighted cost is 0.44, showing moderate misclassification impact.
train_pred_w <- factor(ifelse(train$prob >= best_threshold_weighted, "TRUE", "FALSE"), levels = c("FALSE","TRUE"))
conf_train_w <- confusionMatrix(train_pred_w, train$Class, positive = "TRUE")
conf_train_w$table
## Reference
## Prediction FALSE TRUE
## FALSE 190 209
## TRUE 20 281
MR_train_w <- 1 - conf_train_w$overall["Accuracy"]
MR_train_w
## Accuracy
## 0.3271429
Your observation: The model’s accuracy is 0.33, meaning it correctly predicts about one-third of the cases — showing limited performance.
test_pred_w <- factor(ifelse(test$prob >= best_threshold_weighted, "TRUE", "FALSE"), levels = c("FALSE","TRUE"))
conf_test_w <- confusionMatrix(test_pred_w, test$Class, positive = "TRUE")
conf_test_w$table
## Reference
## Prediction FALSE TRUE
## FALSE 69 92
## TRUE 21 118
MR_test_w <- 1 - conf_test_w$overall["Accuracy"]
MR_test_w
## Accuracy
## 0.3766667
Your observation: The model’s accuracy is 0.38, showing a small improvement but still relatively low prediction performance.
Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model further.
The optimal probability cut-off was around 0.41. For the training data, the model achieved a weighted misclassification rate (MR) of 0.44 and an AUC of 0.85, showing good discrimination ability. On the testing data, the AUC dropped to 0.76, and accuracy remained modest (around 0.38), suggesting possible over fitting.
For improving the model I would try feature selection or regularization to reduce over fitting, tune parameters for better balance between sensitivity and specificity, and possibly test alternative models like random forest or gradient boosting for higher accuracy.