Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format

## Loading required package: ggplot2

## Loading required package: lattice

data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: The dataset has 1,000 rows and 62 variables, mostly numeric or binary. The target variable Class shows whether a customer has good credit (TRUE) or not (FALSE).

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure. (10pts)

colSums(is.na(GermanCredit))

##                           Duration                             Amount 
##                                  0                                  0 
##          InstallmentRatePercentage                  ResidenceDuration 
##                                  0                                  0 
##                                Age              NumberExistingCredits 
##                                  0                                  0 
##            NumberPeopleMaintenance                          Telephone 
##                                  0                                  0 
##                      ForeignWorker                              Class 
##                                  0                                  0 
##         CheckingAccountStatus.lt.0     CheckingAccountStatus.0.to.200 
##                                  0                                  0 
##       CheckingAccountStatus.gt.200     CreditHistory.NoCredit.AllPaid 
##                                  0                                  0 
##     CreditHistory.ThisBank.AllPaid             CreditHistory.PaidDuly 
##                                  0                                  0 
##                CreditHistory.Delay                     Purpose.NewCar 
##                                  0                                  0 
##                    Purpose.UsedCar        Purpose.Furniture.Equipment 
##                                  0                                  0 
##           Purpose.Radio.Television          Purpose.DomesticAppliance 
##                                  0                                  0 
##                    Purpose.Repairs                  Purpose.Education 
##                                  0                                  0 
##                 Purpose.Retraining                   Purpose.Business 
##                                  0                                  0 
##         SavingsAccountBonds.lt.100     SavingsAccountBonds.100.to.500 
##                                  0                                  0 
##    SavingsAccountBonds.500.to.1000        SavingsAccountBonds.gt.1000 
##                                  0                                  0 
##            EmploymentDuration.lt.1          EmploymentDuration.1.to.4 
##                                  0                                  0 
##          EmploymentDuration.4.to.7            EmploymentDuration.gt.7 
##                                  0                                  0 
##   Personal.Male.Divorced.Seperated          Personal.Female.NotSingle 
##                                  0                                  0 
##               Personal.Male.Single        OtherDebtorsGuarantors.None 
##                                  0                                  0 
## OtherDebtorsGuarantors.CoApplicant                Property.RealEstate 
##                                  0                                  0 
##                 Property.Insurance                  Property.CarOther 
##                                  0                                  0 
##         OtherInstallmentPlans.Bank       OtherInstallmentPlans.Stores 
##                                  0                                  0 
##                       Housing.Rent                        Housing.Own 
##                                  0                                  0 
##            Job.UnemployedUnskilled              Job.UnskilledResident 
##                                  0                                  0 
##                Job.SkilledEmployee 
##                                  0

sapply(GermanCredit, class)

##                           Duration                             Amount 
##                          "integer"                          "integer" 
##          InstallmentRatePercentage                  ResidenceDuration 
##                          "integer"                          "integer" 
##                                Age              NumberExistingCredits 
##                          "integer"                          "integer" 
##            NumberPeopleMaintenance                          Telephone 
##                          "integer"                          "numeric" 
##                      ForeignWorker                              Class 
##                          "numeric"                          "logical" 
##         CheckingAccountStatus.lt.0     CheckingAccountStatus.0.to.200 
##                          "numeric"                          "numeric" 
##       CheckingAccountStatus.gt.200     CreditHistory.NoCredit.AllPaid 
##                          "numeric"                          "numeric" 
##     CreditHistory.ThisBank.AllPaid             CreditHistory.PaidDuly 
##                          "numeric"                          "numeric" 
##                CreditHistory.Delay                     Purpose.NewCar 
##                          "numeric"                          "numeric" 
##                    Purpose.UsedCar        Purpose.Furniture.Equipment 
##                          "numeric"                          "numeric" 
##           Purpose.Radio.Television          Purpose.DomesticAppliance 
##                          "numeric"                          "numeric" 
##                    Purpose.Repairs                  Purpose.Education 
##                          "numeric"                          "numeric" 
##                 Purpose.Retraining                   Purpose.Business 
##                          "numeric"                          "numeric" 
##         SavingsAccountBonds.lt.100     SavingsAccountBonds.100.to.500 
##                          "numeric"                          "numeric" 
##    SavingsAccountBonds.500.to.1000        SavingsAccountBonds.gt.1000 
##                          "numeric"                          "numeric" 
##            EmploymentDuration.lt.1          EmploymentDuration.1.to.4 
##                          "numeric"                          "numeric" 
##          EmploymentDuration.4.to.7            EmploymentDuration.gt.7 
##                          "numeric"                          "numeric" 
##   Personal.Male.Divorced.Seperated          Personal.Female.NotSingle 
##                          "numeric"                          "numeric" 
##               Personal.Male.Single        OtherDebtorsGuarantors.None 
##                          "numeric"                          "numeric" 
## OtherDebtorsGuarantors.CoApplicant                Property.RealEstate 
##                          "numeric"                          "numeric" 
##                 Property.Insurance                  Property.CarOther 
##                          "numeric"                          "numeric" 
##         OtherInstallmentPlans.Bank       OtherInstallmentPlans.Stores 
##                          "numeric"                          "numeric" 
##                       Housing.Rent                        Housing.Own 
##                          "numeric"                          "numeric" 
##            Job.UnemployedUnskilled              Job.UnskilledResident 
##                          "numeric"                          "numeric" 
##                Job.SkilledEmployee 
##                          "numeric"

Your observation: There are no missing values in the dataset. Most variables are numeric, while Class is logical, meaning the data is clean and ready for modeling.

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

set.seed(2024)
trainIndex <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)
train <- GermanCredit[trainIndex, ]
test  <- GermanCredit[-trainIndex, ]

#quick check
prop.table(table(train$Class))

## 
## FALSE  TRUE 
##   0.3   0.7

prop.table(table(test$Class))

## 
## FALSE  TRUE 
##   0.3   0.7

Your observation: The data was split into 70% training and 30% testing, keeping the same class balance of about 30% bad and 70% good credit cases.

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

train$Class <- factor(train$Class, levels = c("FALSE","TRUE"))  
test$Class  <- factor(test$Class,  levels = c("FALSE","TRUE"))
lrm_all <- glm(Class ~ ., data = train, family = binomial(link = "logit"))
summary(lrm_all)

## 
## Call:
## glm(formula = Class ~ ., family = binomial(link = "logit"), data = train)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         9.7755921  1.7594925   5.556 2.76e-08 ***
## Duration                           -0.0281752  0.0114559  -2.459 0.013915 *  
## Amount                             -0.0001968  0.0000580  -3.394 0.000690 ***
## InstallmentRatePercentage          -0.3458012  0.1122102  -3.082 0.002058 ** 
## ResidenceDuration                  -0.1477247  0.1099835  -1.343 0.179222    
## Age                                -0.0011930  0.0111092  -0.107 0.914479    
## NumberExistingCredits              -0.1741853  0.2247245  -0.775 0.438277    
## NumberPeopleMaintenance            -0.2953842  0.3033517  -0.974 0.330188    
## Telephone                          -0.8357009  0.2619015  -3.191 0.001418 ** 
## ForeignWorker                      -1.6606566  0.8122576  -2.044 0.040905 *  
## CheckingAccountStatus.lt.0         -2.0280291  0.2899845  -6.994 2.68e-12 ***
## CheckingAccountStatus.0.to.200     -1.4706478  0.2943908  -4.996 5.87e-07 ***
## CheckingAccountStatus.gt.200       -0.6052653  0.4876931  -1.241 0.214577    
## CreditHistory.NoCredit.AllPaid     -1.2639798  0.5155113  -2.452 0.014211 *  
## CreditHistory.ThisBank.AllPaid     -1.8780235  0.5706646  -3.291 0.000999 ***
## CreditHistory.PaidDuly             -0.8775997  0.3159046  -2.778 0.005469 ** 
## CreditHistory.Delay                -0.4012640  0.4307837  -0.931 0.351608    
## Purpose.NewCar                     -1.0626620  0.8142904  -1.305 0.191887    
## Purpose.UsedCar                     1.1942539  0.8839916   1.351 0.176702    
## Purpose.Furniture.Equipment        -0.1681192  0.8320966  -0.202 0.839883    
## Purpose.Radio.Television           -0.3031554  0.8286036  -0.366 0.714467    
## Purpose.DomesticAppliance          -0.7371787  1.2321421  -0.598 0.549646    
## Purpose.Repairs                    -0.8575710  0.9887784  -0.867 0.385776    
## Purpose.Education                  -0.6848705  0.9364025  -0.731 0.464544    
## Purpose.Retraining                 -0.1649183  1.5465838  -0.107 0.915079    
## Purpose.Business                   -0.3600823  0.8535288  -0.422 0.673116    
## SavingsAccountBonds.lt.100         -0.9786195  0.3127225  -3.129 0.001752 ** 
## SavingsAccountBonds.100.to.500     -0.9669534  0.4406228  -2.195 0.028198 *  
## SavingsAccountBonds.500.to.1000    -0.2529878  0.5442721  -0.465 0.642061    
## SavingsAccountBonds.gt.1000         0.2713176  0.6594268   0.411 0.680747    
## EmploymentDuration.lt.1            -0.4435735  0.5345880  -0.830 0.406681    
## EmploymentDuration.1.to.4          -0.4275141  0.5069023  -0.843 0.399013    
## EmploymentDuration.4.to.7           0.4416798  0.5618787   0.786 0.431822    
## EmploymentDuration.gt.7            -0.2520532  0.5037635  -0.500 0.616835    
## Personal.Male.Divorced.Seperated   -0.4301280  0.5538492  -0.777 0.437385    
## Personal.Female.NotSingle          -0.0179029  0.3950224  -0.045 0.963851    
## Personal.Male.Single                0.6299901  0.3971902   1.586 0.112713    
## OtherDebtorsGuarantors.None        -1.0309812  0.5142560  -2.005 0.044984 *  
## OtherDebtorsGuarantors.CoApplicant -1.0727811  0.7201303  -1.490 0.136302    
## Property.RealEstate                 1.2295999  0.5185315   2.371 0.017725 *  
## Property.Insurance                  0.8935212  0.5097800   1.753 0.079643 .  
## Property.CarOther                   1.1356001  0.5048681   2.249 0.024493 *  
## OtherInstallmentPlans.Bank         -0.6436463  0.3046547  -2.113 0.034626 *  
## OtherInstallmentPlans.Stores       -0.2405278  0.4731218  -0.508 0.611184    
## Housing.Rent                       -0.7041915  0.5817432  -1.210 0.226093    
## Housing.Own                        -0.5109041  0.5552490  -0.920 0.357502    
## Job.UnemployedUnskilled             0.4681174  0.8091298   0.579 0.562897    
## Job.UnskilledResident               0.3450109  0.4498926   0.767 0.443156    
## Job.SkilledEmployee                 0.1604813  0.3719210   0.431 0.666110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 855.21  on 699  degrees of freedom
## Residual deviance: 595.77  on 651  degrees of freedom
## AIC: 693.77
## 
## Number of Fisher Scoring iterations: 5

Your observation: The logistic regression model fits well with several significant predictors. Variables like Amount, Duration, CheckingAccountStatus, and Telephone have strong effects on credit risk, while some variables were excluded due to multicollinearity.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

coefs <- summary(lrm_all)$coefficients
odds_ratios <- exp(coef(lrm_all))
cbind(coefs, OddsRatio = odds_ratios)

##                                         Estimate   Std. Error     z value
## (Intercept)                         9.7755920697 1.759492e+00  5.55591574
## Duration                           -0.0281751864 1.145588e-02 -2.45945116
## Amount                             -0.0001968268 5.799757e-05 -3.39370661
## InstallmentRatePercentage          -0.3458012342 1.122102e-01 -3.08172646
## ResidenceDuration                  -0.1477246581 1.099835e-01 -1.34315321
## Age                                -0.0011930223 1.110923e-02 -0.10739018
## NumberExistingCredits              -0.1741853072 2.247245e-01 -0.77510593
## NumberPeopleMaintenance            -0.2953842316 3.033517e-01 -0.97373507
## Telephone                          -0.8357008957 2.619015e-01 -3.19089796
## ForeignWorker                      -1.6606565630 8.122576e-01 -2.04449491
## CheckingAccountStatus.lt.0         -2.0280291351 2.899845e-01 -6.99357666
## CheckingAccountStatus.0.to.200     -1.4706478442 2.943908e-01 -4.99556266
## CheckingAccountStatus.gt.200       -0.6052653383 4.876931e-01 -1.24107823
## CreditHistory.NoCredit.AllPaid     -1.2639798373 5.155113e-01 -2.45189546
## CreditHistory.ThisBank.AllPaid     -1.8780234743 5.706646e-01 -3.29094092
## CreditHistory.PaidDuly             -0.8775997275 3.159046e-01 -2.77805295
## CreditHistory.Delay                -0.4012639884 4.307837e-01 -0.93147437
## Purpose.NewCar                     -1.0626619618 8.142904e-01 -1.30501598
## Purpose.UsedCar                     1.1942539313 8.839916e-01  1.35097883
## Purpose.Furniture.Equipment        -0.1681192131 8.320966e-01 -0.20204290
## Purpose.Radio.Television           -0.3031553848 8.286036e-01 -0.36586297
## Purpose.DomesticAppliance          -0.7371786695 1.232142e+00 -0.59829029
## Purpose.Repairs                    -0.8575710181 9.887784e-01 -0.86730357
## Purpose.Education                  -0.6848705104 9.364025e-01 -0.73138471
## Purpose.Retraining                 -0.1649182660 1.546584e+00 -0.10663390
## Purpose.Business                   -0.3600823059 8.535288e-01 -0.42187482
## SavingsAccountBonds.lt.100         -0.9786194558 3.127225e-01 -3.12935373
## SavingsAccountBonds.100.to.500     -0.9669533652 4.406228e-01 -2.19451492
## SavingsAccountBonds.500.to.1000    -0.2529877648 5.442721e-01 -0.46481853
## SavingsAccountBonds.gt.1000         0.2713175773 6.594268e-01  0.41144460
## EmploymentDuration.lt.1            -0.4435735258 5.345880e-01 -0.82974836
## EmploymentDuration.1.to.4          -0.4275141090 5.069024e-01 -0.84338553
## EmploymentDuration.4.to.7           0.4416798369 5.618787e-01  0.78607684
## EmploymentDuration.gt.7            -0.2520532431 5.037635e-01 -0.50034043
## Personal.Male.Divorced.Seperated   -0.4301280237 5.538492e-01 -0.77661578
## Personal.Female.NotSingle          -0.0179028535 3.950224e-01 -0.04532111
## Personal.Male.Single                0.6299900791 3.971902e-01  1.58611688
## OtherDebtorsGuarantors.None        -1.0309812406 5.142559e-01 -2.00480178
## OtherDebtorsGuarantors.CoApplicant -1.0727811145 7.201303e-01 -1.48970425
## Property.RealEstate                 1.2295998649 5.185315e-01  2.37131168
## Property.Insurance                  0.8935212291 5.097800e-01  1.75275860
## Property.CarOther                   1.1356000792 5.048681e-01  2.24930068
## OtherInstallmentPlans.Bank         -0.6436463267 3.046547e-01 -2.11270803
## OtherInstallmentPlans.Stores       -0.2405277566 4.731218e-01 -0.50838439
## Housing.Rent                       -0.7041915492 5.817432e-01 -1.21048523
## Housing.Own                        -0.5109041125 5.552490e-01 -0.92013510
## Job.UnemployedUnskilled             0.4681173901 8.091298e-01  0.57854424
## Job.UnskilledResident               0.3450109181 4.498926e-01  0.76687392
## Job.SkilledEmployee                 0.1604813103 3.719210e-01  0.43149304
##                                        Pr(>|z|)    OddsRatio
## (Intercept)                        2.761601e-08 1.759891e+04
## Duration                           1.391496e-02 9.722180e-01
## Amount                             6.895352e-04 9.998032e-01
## InstallmentRatePercentage          2.058039e-03 7.076531e-01
## ResidenceDuration                  1.792224e-01 8.626686e-01
## Age                                9.144794e-01 9.988077e-01
## NumberExistingCredits              4.382771e-01 8.401412e-01
## NumberPeopleMaintenance            3.301881e-01 7.442456e-01
## Telephone                          1.418314e-03 4.335705e-01
## ForeignWorker                      4.090468e-02 1.900142e-01
## CheckingAccountStatus.lt.0         2.679653e-12 1.315946e-01
## CheckingAccountStatus.0.to.200     5.866447e-07 2.297766e-01
## CheckingAccountStatus.gt.200       2.145769e-01 5.459296e-01
## CreditHistory.NoCredit.AllPaid     1.421059e-02 2.825274e-01
## CreditHistory.ThisBank.AllPaid     9.985288e-04 1.528920e-01
## CreditHistory.PaidDuly             5.468571e-03 4.157797e-01
## CreditHistory.Delay                3.516082e-01 6.694733e-01
## Purpose.NewCar                     1.918874e-01 3.455348e-01
## Purpose.UsedCar                    1.767022e-01 3.301094e+00
## Purpose.Furniture.Equipment        8.398832e-01 8.452531e-01
## Purpose.Radio.Television           7.144673e-01 7.384843e-01
## Purpose.DomesticAppliance          5.496463e-01 4.784619e-01
## Purpose.Repairs                    3.857757e-01 4.241912e-01
## Purpose.Education                  4.645442e-01 5.041555e-01
## Purpose.Retraining                 9.150794e-01 8.479630e-01
## Purpose.Business                   6.731164e-01 6.976189e-01
## SavingsAccountBonds.lt.100         1.751913e-03 3.758296e-01
## SavingsAccountBonds.100.to.500     2.819841e-02 3.802397e-01
## SavingsAccountBonds.500.to.1000    6.420614e-01 7.764774e-01
## SavingsAccountBonds.gt.1000        6.807466e-01 1.311692e+00
## EmploymentDuration.lt.1            4.066811e-01 6.417390e-01
## EmploymentDuration.1.to.4          3.990129e-01 6.521282e-01
## EmploymentDuration.4.to.7          4.318225e-01 1.555318e+00
## EmploymentDuration.gt.7            6.168354e-01 7.772034e-01
## Personal.Male.Divorced.Seperated   4.373855e-01 6.504258e-01
## Personal.Female.NotSingle          9.638514e-01 9.822565e-01
## Personal.Male.Single               1.127128e-01 1.877592e+00
## OtherDebtorsGuarantors.None        4.498424e-02 3.566568e-01
## OtherDebtorsGuarantors.CoApplicant 1.363020e-01 3.420559e-01
## Property.RealEstate                1.772508e-02 3.419861e+00
## Property.Insurance                 7.964345e-02 2.443719e+00
## Property.CarOther                  2.449337e-02 3.113041e+00
## OtherInstallmentPlans.Bank         3.462577e-02 5.253732e-01
## OtherInstallmentPlans.Stores       6.111838e-01 7.862128e-01
## Housing.Rent                       2.260928e-01 4.945082e-01
## Housing.Own                        3.575022e-01 5.999529e-01
## Job.UnemployedUnskilled            5.628967e-01 1.596985e+00
## Job.UnskilledResident              4.431565e-01 1.412005e+00
## Job.SkilledEmployee                6.661099e-01 1.174076e+00

Your observation: Important predictors include CheckingAccountStatus, Amount, and CreditHistory, which strongly affect credit risk. For example, lower account status greatly reduces the odds of being a good credit risk.

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

train$prob <- predict(lrm_all, newdata = train, type = "response")

Your observation:

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

get_MR <- function(threshold, obs, prob) {
  pred <- ifelse(prob >= threshold, "TRUE", "FALSE")
  cm <- table(Predicted = pred, Actual = obs)
  
#misclassification rate
  MR <- 1 - sum(diag(cm)) / sum(cm)
  return(list(MR = MR, cm = cm))
}

thresholds <- seq(0, 1, by = 0.001)
mr_vals <- sapply(thresholds, function(t) get_MR(t, train$Class, train$prob)$MR)
best_idx <- which.min(mr_vals)
best_threshold_equal <- thresholds[best_idx]
best_threshold_equal

## [1] 0.411

min(mr_vals)

## [1] 0.19

# Show confusion matrix at best threshold
best_result_equal <- get_MR(best_threshold_equal, train$Class, train$prob)
best_result_equal$cm

##          Actual
## Predicted FALSE TRUE
##     FALSE   104   27
##     TRUE    106  463

Your observation: The optimal probability cutoff is around 0.41, giving a misclassification rate of about 19%. The model correctly classifies most good credit cases but misses some bad ones.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

train_pred_equal <- factor(ifelse(train$prob >= best_threshold_equal, "TRUE", "FALSE"), levels = c("FALSE","TRUE"))
conf_train_equal <- confusionMatrix(train_pred_equal, train$Class, positive = "TRUE")
conf_train_equal$table

##           Reference
## Prediction FALSE TRUE
##      FALSE   104   27
##      TRUE    106  463

conf_train_equal$overall["Accuracy"]

## Accuracy 
##     0.81

# MR = 1 - Accuracy
MR_train_equal <- 1 - conf_train_equal$overall["Accuracy"]
MR_train_equal

## Accuracy 
##     0.19

Your observation: The model achieved about 81% accuracy and a 19% misclassification rate on the training set, showing a fairly good fit.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

if (!require(pROC, quietly = TRUE)) {
    install.packages("pROC")
}

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(pROC)

roc_train <- roc(response = as.numeric(train$Class == "TRUE"), predictor = train$prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc_train <- auc(roc_train)
auc_train

## Area under the curve: 0.8497

Your observation: The model’s AUC is 0.85, showing good ability to distinguish between good and bad credit applicants.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

test$prob <- predict(lrm_all, newdata = test, type = "response")
test_pred_equal <- factor(ifelse(test$prob >= best_threshold_equal, "TRUE", "FALSE"), levels = c("FALSE","TRUE"))
conf_test_equal <- confusionMatrix(test_pred_equal, test$Class, positive = "TRUE")
conf_test_equal$table

##           Reference
## Prediction FALSE TRUE
##      FALSE    30   22
##      TRUE     60  188

MR_test_equal <- 1 - conf_test_equal$overall["Accuracy"]
MR_test_equal

##  Accuracy 
## 0.2733333

Your observation: The model’s accuracy dropped to 0.27, meaning it’s now performing poorly at correctly predicting outcomes.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

roc_test <- roc(response = as.numeric(test$Class == "TRUE"), predictor = test$prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc_test <- auc(roc_test)
auc_test

## Area under the curve: 0.7562

Your observation: The model’s AUC is 0.76, showing a moderate ability to distinguish between good and bad credit cases.

Task 5: Using different weights (20pts)

Now, let’s assume “It is worse to class a customer as good when they are bad (weight = 5), than it is to class a customer as bad when they are good (weight = 1).” Please figure out which weight should be 5 and which weight should be 1. Then define your cost function accordingly!

1. Obtain optimal probability cut-off point again, with the new weights.

weight_FP <- 5  # cost of predicted = TRUE when actual = FALSE
weight_FN <- 1  # cost of predicted = FALSE when actual = TRUE

get_weighted_cost <- function(threshold, obs, prob, wFP, wFN) {
  pred <- ifelse(prob >= threshold, "TRUE", "FALSE")
  # counts:
  TP <- sum(pred == "TRUE"  & obs == "TRUE")
  TN <- sum(pred == "FALSE" & obs == "FALSE")
  FP <- sum(pred == "TRUE"  & obs == "FALSE")
  FN <- sum(pred == "FALSE" & obs == "TRUE")
  # weighted cost (normalized by n to be comparable to MR)
  wcost <- (wFP * FP + wFN * FN) / (TP + TN + FP + FN)
  return(list(wcost = wcost, conf = c(TP = TP, TN = TN, FP = FP, FN = FN)))
}

wcosts <- sapply(thresholds, function(t) get_weighted_cost(t, train$Class, train$prob, weight_FP, weight_FN)$wcost)
best_idx_w <- which.min(wcosts)
best_threshold_weighted <- thresholds[best_idx_w]
best_threshold_weighted

## [1] 0.837

min(wcosts)

## [1] 0.4414286

# Confusion matrix at weighted best threshold (training)
weighted_best <- get_weighted_cost(best_threshold_weighted, train$Class, train$prob, weight_FP, weight_FN)
weighted_best

## $wcost
## [1] 0.4414286
## 
## $conf
##  TP  TN  FP  FN 
## 281 190  20 209

Your observation: The model correctly identified most positive cases (TP = 281) but missed several (FN = 209). The weighted cost is 0.44, showing moderate misclassification impact.

2. Obtain the confusion matrix and MR for the training set.

train_pred_w <- factor(ifelse(train$prob >= best_threshold_weighted, "TRUE", "FALSE"), levels = c("FALSE","TRUE"))
conf_train_w <- confusionMatrix(train_pred_w, train$Class, positive = "TRUE")
conf_train_w$table

##           Reference
## Prediction FALSE TRUE
##      FALSE   190  209
##      TRUE     20  281

MR_train_w <- 1 - conf_train_w$overall["Accuracy"]
MR_train_w

##  Accuracy 
## 0.3271429

Your observation: The model’s accuracy is 0.33, meaning it correctly predicts about one-third of the cases — showing limited performance.

3. Obtain the confusion matrix and MR for the test set.

test_pred_w <- factor(ifelse(test$prob >= best_threshold_weighted, "TRUE", "FALSE"), levels = c("FALSE","TRUE"))
conf_test_w <- confusionMatrix(test_pred_w, test$Class, positive = "TRUE")
conf_test_w$table

##           Reference
## Prediction FALSE TRUE
##      FALSE    69   92
##      TRUE     21  118

MR_test_w <- 1 - conf_test_w$overall["Accuracy"]
MR_test_w

##  Accuracy 
## 0.3766667

Your observation: The model’s accuracy is 0.38, showing a small improvement but still relatively low prediction performance.

Task 6: Conlusion (10pts)

Summarize your findings, including the optimal probability cut-off, MR and AUC for both training and testing data. Discuss what you observed and what you will do to improve the model further.

The optimal probability cut-off was around 0.41. For the training data, the model achieved a weighted misclassification rate (MR) of 0.44 and an AUC of 0.85, showing good discrimination ability. On the testing data, the AUC dropped to 0.76, and accuracy remained modest (around 0.38), suggesting possible over fitting.

For improving the model I would try feature selection or regularization to reduce over fitting, tune parameters for better balance between sensitivity and specificity, and possibly test alternative models like random forest or gradient boosting for higher accuracy.

Homework5

tianhai

10/23/2025

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

Homework5

tianhai

10/23/2025

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure. (10pts)

3. Split the dataset into training and test set. Please use the random seed as 2024 for reproducibility. (10pts)

Task 2: Model Fitting (20pts)

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right.

2. Summarize the model and interpret the coefficients (pick at least one coefficient you think important and discuss it in detail).

Task 3: Find Optimal Probability Cut-off, with weight_FN = 1 and weight_FP = 1. (20pts)

1. Use the training set to obtain predicted probabilities.

2. Find the optimal probability cut-off point using the MR (misclassification rate) or equivalently the equal-weight cost.

Task 4: Model Evaluation (20pts)

1. Using the optimal probability cut-off point obtained in 3.2, generate confusion matrix and obtain MR for the the training set.

2. Using the optimal probability cut-off point obtained in 3.2, generate the ROC curve and calculate the AUC for the training set.

3. Using the same cut-off point, generate confusion matrix and obtain MR for the test set.

4. Using the same cut-off point, generate the ROC curve and calculate the AUC for the test set.

Task 5: Using different weights (20pts)

1. Obtain optimal probability cut-off point again, with the new weights.

2. Obtain the confusion matrix and MR for the training set.

3. Obtain the confusion matrix and MR for the test set.

Task 6: Conlusion (10pts)

3. Split the dataset into training and test set. Please use the random seed as `2024` for reproducibility. (10pts)