Logistic Regression to classify customers based on the Credit Risk

German Credit Scoring Data

The German Credit Scoring data set contains classification of customers into good credit risk and bad risk based on several factors. The dataset contains 1000 observations of 21 variables. This dataset contains different factors which might contribute to a customer being classified as posing risk on a credit taken from the bank.There are 20 predictors into consideration to predict the class of the variable ‘response’.

set.seed(12383117)
library(verification)
library(dplyr)
library(tidyr)
library(MASS)
library(tidyverse)
library(ggcorrplot)
library(ggpubr)
german_credit <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")

colnames(german_credit) <- c("chk_acct", "duration", "credit_his", "purpose", 
                            "amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor", 
                            "present_resid", "property", "age", "other_install", "housing", "n_credits", 
                            "job", "n_people", "telephone", "foreign", "response")

# orginal response coding 1= good, 2 = bad we need 0 = good, 1 = bad
german_credit$response = german_credit$response - 1

Exploratory Data Analysis

We first check the summary statistics for the important variables.

summary(german_credit)

##  chk_acct     duration    credit_his    purpose        amount     
##  A11:274   Min.   : 4.0   A30: 40    A43    :280   Min.   :  250  
##  A12:269   1st Qu.:12.0   A31: 49    A40    :234   1st Qu.: 1366  
##  A13: 63   Median :18.0   A32:530    A42    :181   Median : 2320  
##  A14:394   Mean   :20.9   A33: 88    A41    :103   Mean   : 3271  
##            3rd Qu.:24.0   A34:293    A49    : 97   3rd Qu.: 3972  
##            Max.   :72.0              A46    : 50   Max.   :18424  
##                                      (Other): 55                  
##  saving_acct present_emp installment_rate  sex      other_debtor
##  A61:603     A71: 62     Min.   :1.000    A91: 50   A101:907    
##  A62:103     A72:172     1st Qu.:2.000    A92:310   A102: 41    
##  A63: 63     A73:339     Median :3.000    A93:548   A103: 52    
##  A64: 48     A74:174     Mean   :2.973    A94: 92               
##  A65:183     A75:253     3rd Qu.:4.000                          
##                          Max.   :4.000                          
##                                                                 
##  present_resid   property        age        other_install housing   
##  Min.   :1.000   A121:282   Min.   :19.00   A141:139      A151:179  
##  1st Qu.:2.000   A122:232   1st Qu.:27.00   A142: 47      A152:713  
##  Median :3.000   A123:332   Median :33.00   A143:814      A153:108  
##  Mean   :2.845   A124:154   Mean   :35.55                           
##  3rd Qu.:4.000              3rd Qu.:42.00                           
##  Max.   :4.000              Max.   :75.00                           
##                                                                     
##    n_credits       job         n_people     telephone  foreign   
##  Min.   :1.000   A171: 22   Min.   :1.000   A191:596   A201:963  
##  1st Qu.:1.000   A172:200   1st Qu.:1.000   A192:404   A202: 37  
##  Median :1.000   A173:630   Median :1.000                        
##  Mean   :1.407   A174:148   Mean   :1.155                        
##  3rd Qu.:2.000              3rd Qu.:1.000                        
##  Max.   :4.000              Max.   :2.000                        
##                                                                  
##     response  
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.0  
##  Mean   :0.3  
##  3rd Qu.:1.0  
##  Max.   :1.0  
##

There are 13 categorical variables in the dataset.There are no missing values. The variable ‘response’ takes binary values - 0 for low risk(good) and 1 for high risk(bad).

We are also interested in checking the correlation between important variables and response.

factors.credit <- c(1,3,4,6,7,9,10,12,14,15,17,19,20)
german_credit[,factors.credit] <- lapply(german_credit[,factors.credit], factor)
num.credit <- c(2,5,8,11,13,16,18,21)
german_credit[num.credit] <- sapply(german_credit[num.credit], as.numeric)
cor.credit <- cor(german_credit[num.credit])
ggcorrplot(cor.credit,hc.order = TRUE,lab=TRUE) + ggtitle("Correlation between numeric variables") + theme(axis.text.x = element_text(angle = 90, vjust=0.5))

We can see that variables amount and duration have a positive correlation with response, while age impacts it negatively. Hence higher amount and duration imply the value of response will be be higher i.e. closer to 1. Hence, it increases the chance of high credit risk i.e. bad customer.

We now split our data set into two parts - one with customers who have been classified as good, and one with customers classified as bad. The data sets contain 700 and 300 observations respectively. The distribution of variables for each set is also checked.

#Boxplot for age
ggplot(german_credit,aes(x=as.factor(response),y=age))+ geom_boxplot(fill="blue") + labs(title='Age by risk classification',x='Risk Classification(1 for bad risk, 0 for good risk)',y='Age(in years)')

We can see that median age for people classified as bad is lower. Hence, younger people have more chance of being a bad customer and pose higher credit risk.

#Boxplot for amount
ggplot(german_credit,aes(x=as.factor(response),y=amount))+ geom_boxplot(fill="green") + labs(title='Credit Amount by risk classification',x='Risk Classification(1 for bad risk, 0 for good risk)',y='Credit Amount')

We can see that median credit amount for bad customers is slightly higher. This indicates that people who have higher credit amounts might be higher risk customers.

#Boxplot for duration
ggplot(german_credit,aes(x=as.factor(response),y=duration))+ geom_boxplot(fill="brown") + labs(title='Duration of credit by risk classification',x='Risk Classification(1 for bad risk, 0 for good risk)',y='Duration of credit')

We can see that customers classified as bad have higher median duration of credit.

Splitting dataset into Training and Test data

We first split our dataset into training and test data set. We randomly select 80% of the observations as a part of the training data set and 20% of the observations as the test data set. The model will be built on the training data set and will be tested on the test data set.

#Create testing and training data set
subset <- sample(nrow(german_credit), nrow(german_credit) * 0.8)
credit.train <- german_credit[subset, ]
credit.test <- german_credit[-subset, ]

Model creation using Logistic Regression

Since the response variable is qualitative i.e. we classify the customer as Good or Bad based on Credit Risk, we use the Logistic Regression technique for model creation.

Model selection using the AIC criterion

We build two models. First model is the null model containing only the response variable and no predictors. The second model is the full model containing all 20 predictors.

Now we select the best model using the stepwise selection method - with AIC criterion.

credit.glm0 <- glm(response ~ ., family = "binomial", credit.train)
null <- glm(response~1,family = "binomial",credit.train)
full <- glm(response~.,family = "binomial",credit.train)
lm.AIC <- stepAIC(full,scope=list(lower=null,upper=full),k=2,direction = "both")

summary(lm.AIC)

## 
## Call:
## glm(formula = response ~ chk_acct + duration + credit_his + purpose + 
##     amount + saving_acct + installment_rate + sex + other_debtor + 
##     other_install + housing + n_credits + telephone + foreign, 
##     family = "binomial", data = credit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2517  -0.6770  -0.3673   0.6500   2.9525  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        4.444e-01  8.597e-01   0.517 0.605206    
## chk_acctA12       -5.669e-01  2.469e-01  -2.296 0.021682 *  
## chk_acctA13       -1.256e+00  4.167e-01  -3.015 0.002572 ** 
## chk_acctA14       -1.863e+00  2.574e-01  -7.238 4.57e-13 ***
## duration           2.462e-02  1.025e-02   2.402 0.016317 *  
## credit_hisA31      1.951e-01  6.141e-01   0.318 0.750761    
## credit_hisA32     -5.820e-01  4.787e-01  -1.216 0.224093    
## credit_hisA33     -6.401e-01  5.240e-01  -1.221 0.221934    
## credit_hisA34     -1.611e+00  4.844e-01  -3.326 0.000881 ***
## purposeA41        -1.786e+00  4.249e-01  -4.202 2.64e-05 ***
## purposeA410       -8.062e-01  8.125e-01  -0.992 0.321088    
## purposeA42        -6.379e-01  2.856e-01  -2.233 0.025517 *  
## purposeA43        -9.070e-01  2.807e-01  -3.231 0.001235 ** 
## purposeA44        -5.604e-01  8.712e-01  -0.643 0.520073    
## purposeA45        -3.518e-01  6.582e-01  -0.535 0.592956    
## purposeA46         2.195e-01  4.529e-01   0.485 0.627925    
## purposeA48        -2.322e+00  1.253e+00  -1.853 0.063853 .  
## purposeA49        -8.577e-01  3.791e-01  -2.263 0.023655 *  
## amount             1.645e-04  4.875e-05   3.375 0.000738 ***
## saving_acctA62    -2.122e-01  3.172e-01  -0.669 0.503423    
## saving_acctA63    -1.187e-01  4.246e-01  -0.280 0.779800    
## saving_acctA64    -1.459e+00  5.789e-01  -2.520 0.011740 *  
## saving_acctA65    -1.007e+00  2.939e-01  -3.426 0.000613 ***
## installment_rate   4.063e-01  9.829e-02   4.133 3.57e-05 ***
## sexA92            -1.406e-01  4.105e-01  -0.343 0.731931    
## sexA93            -8.778e-01  4.028e-01  -2.179 0.029326 *  
## sexA94            -3.046e-01  4.801e-01  -0.634 0.525783    
## other_debtorA102   6.940e-01  4.291e-01   1.617 0.105834    
## other_debtorA103  -1.628e+00  5.564e-01  -2.926 0.003429 ** 
## other_installA142 -3.622e-02  4.487e-01  -0.081 0.935662    
## other_installA143 -6.043e-01  2.676e-01  -2.258 0.023942 *  
## housingA152       -4.778e-01  2.560e-01  -1.866 0.062018 .  
## housingA153       -6.258e-01  3.811e-01  -1.642 0.100558    
## n_credits          3.105e-01  2.134e-01   1.455 0.145640    
## telephoneA192     -3.195e-01  2.136e-01  -1.496 0.134639    
## foreignA202       -1.066e+00  6.711e-01  -1.589 0.112125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 973.97  on 799  degrees of freedom
## Residual deviance: 704.81  on 764  degrees of freedom
## AIC: 776.81
## 
## Number of Fisher Scoring iterations: 5

The model with AIC criterion for selection has 14 predictors. It drops the following variables - present_emp, present_resid, property, age, job, and people.

In-Sample performance of the model

Using the model generated through logistic regression, we now plot the ROC curve (true positives vs. false positives) using the training data and find the Area Under the curve (AUC ). Higher value of AUC will indicate better model performance.

AIC.mod <- glm(formula = response ~ chk_acct + duration + credit_his + purpose + 
                 amount + saving_acct + present_emp + installment_rate + sex + 
                 other_debtor + property + age + other_install + n_credits + 
                 foreign, family = "binomial", data = credit.train)

#ROC curve
in.prob <- predict(AIC.mod,type="response")
roc.plot(credit.train$response == "1", in.prob)$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8417043 4.067723e-53          NA

For the training dataset, the AUC value comes out to be 0.8417.

For the misclassification rate, we find an optimal cut-off probability derived using the asymmetric cost function (weight for high credit risk is chosen as 5 compared to 1 for low credit risk). The cost function implies that the wrongly classifying customers that are supposed to be classified as Bad costs 5 times than wrongly classifying customers supposed to be classified as Good. We plot the cost function in training set against the cut-off probability.

##Cost function
# define the searc grid from 0.01 to 0.99
searchgrid = seq(0.01, 0.99, 0.01)
# result is a 99x2 matrix, the 1st col stores the cut-off p, the 2nd column
# stores the cost
result <- cbind(searchgrid, NA)
# in the cost function, both r and pi are vectors, r=truth, pi=predicted
# probability
cost1 <- function(r, pi) {
  weight1 = 5
  weight0 = 1
  c1 = (r == 1) & (pi < pcut)  #logical vector - true if actual 1 but predict 0
  c0 = (r == 0) & (pi > pcut)  #logical vecotr - true if actual 0 but predict 1
  return(mean(weight1 * c1 + weight0 * c0))
}
for (i in 1:length(searchgrid)) {
  pcut <- result[i, 1]
  # assign the cost to the 2nd col
  result[i, 2] <- cost1(credit.train$response, predict(AIC.mod, type = "response"))
}

plot(result, ylab = "Cost in Training Set")

pc <- result[which(result[,2] == min(result[,2])),]
pc[1]

## searchgrid 
##       0.16

The value of cut-off probability comes out to be 0.16.

We now calculate the misclassification rate using this value.

#In sample misclassification rate

credit.train$pred <-  ifelse(in.prob>pc[1],1,0)
table(pred = credit.train$pred , true = credit.train$response)

##     true
## pred   0   1
##    0 319  18
##    1 243 220

mean(credit.train$pred != credit.train$response)

## [1] 0.32625

The misclassification rate comes out to be 0.3265. This implies the model created using the training data is around 67.35% accurate in predicting the customer’s classification.

Out-of-Sample Performance of the Model

We now test the model created using the training data set on the test data set containing 20% of the observations. The test data set contains 800 observations of 21 variables. We will find the AUC and the asymmetric misclassification rate for this dataset as well.

We plot the ROC curve for the test data set.

model <- glm(formula = response ~ chk_acct + duration + credit_his + purpose + 
                 amount + saving_acct + present_emp + installment_rate + sex + 
                 other_debtor + property + age + other_install + n_credits + 
                 foreign, family = "binomial", data = credit.test)

out.prob <- predict(model,type="response")

#ROC curve
roc.plot(credit.test$response == "1", out.prob)$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8699158 3.158491e-17          NA

The value of AUC comes out to be 0.8699.

We now calculate the asymmetric classification rate for the test data set. It is to be noted that we will be using the value of cut-off probability as 0.16(the value we had derived using the cost function).

credit.test$pred <-  ifelse(out.prob>0.16,1,0)
table(pred = credit.test$pred , true = credit.test$response)

##     true
## pred  0  1
##    0 79  5
##    1 59 57

mean(credit.test$pred != credit.test$response)

## [1] 0.32

The value of asymmetric misclassification rate comes out to be 0.32. This means our model is around 68% accurate in predicting the credit risk classification for customers in the test data set.

Conclusion

The final model created for classifying the customers based on credit risk using logistics regression contains 14 predictors. The in-sample and out-of-sample performance of this model is comparable. The model can accurately predict the credit risk classification for more than 67% of the customers in the data set.