German Credit Risk Analysis: Part-1

The German credit scoring data is a dataset provided by Prof. Hogmann in the file german.data. The data set has information about 1000 individuals, on the basis of which they have been classified as risky or not. The variable response in the dataset corresponds to the risk label, 1 has been classified as bad and 2 has been classified as good.

In Part 1, we have performed Exploratory Data Analysis on the dataset and furthered it into building a Logistic Regression Model along with variable selection for the model building process.
In Part 2, we use CART, Bagging and Random Forest for model creation.

Libraries Used

We have used the following libraries for our analysis:

library(knitr)
library(dplyr)
library(tidyr)
library(reshape2)
library(RColorBrewer)
library(GGally)
library(ggplot2)
library(caret)
library(glmnet)
library(boot)
library(verification)

Initial EDA

Data Importing

We get the data from the link. We need to provide names for the columns and change the response labels to 1 and 0:
0 corresponding to a good credit record and 1 corresponding to a bad one (positive class).

german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")

colnames(german_credit) = c("chk_acct", "duration", "credit_his", "purpose", 
                            "amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor", 
                            "present_resid", "property", "age", "other_install", "housing", "n_credits", 
                            "job", "n_people", "telephone", "foreign", "response")

german_credit$response = german_credit$response - 1
german_credit$response <- as.factor(german_credit$response)

Data Structure

There is a total on 21 attributes in the dataset. Their descriptions and details have been tabulated below:

  • Status of existing checking account.
  • Duration in month
  • Credit history
  • Purpose
  • Credit amount
  • Savings account/bonds
  • Present employment since
  • Installment rate in percentage of disposable income
  • Personal status and sex
  • Other debtors / guarantors
  • Present residence since
  • Property
  • Age in years
  • Other installment plans
  • Housing
  • Number of existing credits at this bank
  • Job
  • Number of people being liable to provide maintenance for
  • Telephone
  • foreign worker
glimpse(german_credit)
## Observations: 1,000
## Variables: 21
## $ chk_acct         <fct> A11, A12, A14, A11, A11, A14, A14, A12, A14, ...
## $ duration         <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48...
## $ credit_his       <fct> A34, A32, A34, A32, A33, A32, A32, A32, A32, ...
## $ purpose          <fct> A43, A43, A46, A42, A40, A46, A42, A41, A43, ...
## $ amount           <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 694...
## $ saving_acct      <fct> A65, A61, A61, A61, A61, A65, A63, A61, A64, ...
## $ present_emp      <fct> A75, A73, A74, A74, A73, A73, A75, A73, A74, ...
## $ installment_rate <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, ...
## $ sex              <fct> A93, A92, A93, A93, A93, A93, A93, A93, A91, ...
## $ other_debtor     <fct> A101, A101, A101, A103, A101, A101, A101, A10...
## $ present_resid    <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, ...
## $ property         <fct> A121, A121, A121, A122, A124, A124, A122, A12...
## $ age              <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 2...
## $ other_install    <fct> A143, A143, A143, A143, A143, A143, A143, A14...
## $ housing          <fct> A152, A152, A152, A153, A153, A153, A152, A15...
## $ n_credits        <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, ...
## $ job              <fct> A173, A173, A172, A173, A173, A172, A173, A17...
## $ n_people         <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ telephone        <fct> A192, A191, A191, A191, A191, A192, A191, A19...
## $ foreign          <fct> A201, A201, A201, A201, A201, A201, A201, A20...
## $ response         <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, ...

Summary Stats

We take the summary statistics of the dataset, the dataset has a total of 1000 observations with 21 variables, out of which 8 are numerical variables including the response and 13 are categorical variables with various levels. The summary statistics for the variables have been presented

summary(german_credit)
##  chk_acct     duration    credit_his    purpose        amount     
##  A11:274   Min.   : 4.0   A30: 40    A43    :280   Min.   :  250  
##  A12:269   1st Qu.:12.0   A31: 49    A40    :234   1st Qu.: 1366  
##  A13: 63   Median :18.0   A32:530    A42    :181   Median : 2320  
##  A14:394   Mean   :20.9   A33: 88    A41    :103   Mean   : 3271  
##            3rd Qu.:24.0   A34:293    A49    : 97   3rd Qu.: 3972  
##            Max.   :72.0              A46    : 50   Max.   :18424  
##                                      (Other): 55                  
##  saving_acct present_emp installment_rate  sex      other_debtor
##  A61:603     A71: 62     Min.   :1.000    A91: 50   A101:907    
##  A62:103     A72:172     1st Qu.:2.000    A92:310   A102: 41    
##  A63: 63     A73:339     Median :3.000    A93:548   A103: 52    
##  A64: 48     A74:174     Mean   :2.973    A94: 92               
##  A65:183     A75:253     3rd Qu.:4.000                          
##                          Max.   :4.000                          
##                                                                 
##  present_resid   property        age        other_install housing   
##  Min.   :1.000   A121:282   Min.   :19.00   A141:139      A151:179  
##  1st Qu.:2.000   A122:232   1st Qu.:27.00   A142: 47      A152:713  
##  Median :3.000   A123:332   Median :33.00   A143:814      A153:108  
##  Mean   :2.845   A124:154   Mean   :35.55                           
##  3rd Qu.:4.000              3rd Qu.:42.00                           
##  Max.   :4.000              Max.   :75.00                           
##                                                                     
##    n_credits       job         n_people     telephone  foreign    response
##  Min.   :1.000   A171: 22   Min.   :1.000   A191:596   A201:963   0:700   
##  1st Qu.:1.000   A172:200   1st Qu.:1.000   A192:404   A202: 37   1:300   
##  Median :1.000   A173:630   Median :1.000                                 
##  Mean   :1.407   A174:148   Mean   :1.155                                 
##  3rd Qu.:2.000              3rd Qu.:1.000                                 
##  Max.   :4.000              Max.   :2.000                                 
## 

Exploratory Data Analysis of Continuous Data

We get the following insights from our EDA of continuous variables:

  • From the age variable, we see that the median value for bad records is lesser than that of good records, it might be premature to say young people tend to have bad credit records, but we can safely assume it tends to be riskier.
  • The installment_rate variable has a great deal of difference between the good and bad records, we see that bad records have almost the double median value than good ones.
  • The median value and the range of the duration variables appears to be on the higher side of bad records as compared to good records
  • For the amount variable, we observe that the amount for bad records is larger in general as compared to good ones
  • We further built on this by plotting the density curve along the vertical line for their mean value and find that there is a great deal of difference for the duration as well as amount variable.

Duration

amount.mean = german_credit %>% dplyr::select(amount, response) %>% group_by(response) %>% summarise(m =mean(amount))
duration.mean = german_credit %>% dplyr::select(duration, response) %>%group_by(response) %>% summarise( m =mean(duration))

ggplot(german_credit, aes(duration, fill=response)) + 
  geom_density(alpha=.5) + geom_vline(data=german_credit, aes(xintercept=duration.mean[,2],  colour=response),
                                      linetype="dashed", size=1)

test.m = german_credit[,c(2,5,8,13,16,18,21)]
test.m$response <- as.numeric(test.m$response)
ggplot(melt(german_credit[,c(2,21)]), aes(x = variable, y = value, fill = response)) + geom_boxplot() + xlab("response") + ylab("duration")

Installment Rate

ggplot(german_credit, aes(factor(installment_rate), ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") + xlab("Installment Rates")

Amount

ggplot(german_credit, aes(amount, fill=response)) + 
  geom_density(alpha=.5) + geom_vline(data=german_credit, aes(xintercept=amount.mean[,2],  colour=response),
                                      linetype="dashed", size=1)

ggplot(melt(german_credit[,c(5,21)]), aes(x = variable, y = value, fill = response)) + 
geom_boxplot() + xlab("response") + ylab("amount")

Age

ggplot(melt(german_credit[,c(13,21)]), aes(x = variable, y = value, fill = response)) + 
geom_boxplot()+ xlab("response") + ylab("age")

n_credits

ggplot(melt(german_credit[,c(16,21)]), aes(x = variable, y = value, fill = response)) + 
geom_boxplot()

Exploratory Data Analysis of Categorical Data

We get the following insights from our EDA of categorical variables:

  • For chk_acct we see that, the current status of the checking account matters as the frequency of the response variables is seen to differ from one sub category to another, overall A11 houses more number of bad credit records and A14 the least
  • For credit_his, we observe that proportion of the response variable varies significantly, for categories A30, A31 we see the number of bad credit records are greater.
  • For the purpose variable, we observe that the proportion of good and bad credit record varies also overall A44, A45, A410 and A46 seem to include more risky records.
  • We also observe these trends in other variables like sex, other_debtor, saving_acct, other_install and foreign. Overall, the trend looks significant in saving_acct, purpose, credit_his and chk_acct as compared to others.

chk_acct

ggplot(german_credit, aes(chk_acct, ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") 

credit_his

ggplot(german_credit, aes(credit_his, ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") 

purpose

ggplot(german_credit, aes(purpose, ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") 

saving_acct

ggplot(german_credit, aes(saving_acct, ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") 

other_debtor

ggplot(german_credit, aes(other_debtor, ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") 

sex

ggplot(german_credit, aes(sex, ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") 

other_install

ggplot(german_credit, aes(other_install, ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") 

foreign

ggplot(german_credit, aes(foreign, ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") 

Logistic Regression to Predict Riskiness

We build a logistic regression model to predict riskiness

Train/Test split

Splitting our data into 80:20 Train test split using stratifiesd sampling to get equal amounts of data from each class

set.seed(12420424)
in.train <- createDataPartition(as.factor(german_credit$response), p=0.8, list=FALSE)
german_credit.train <- german_credit[in.train,]
german_credit.test <- german_credit[-in.train,]

Stepwise variable selection using AIC

From stepwise variable selection method using AIC, the significant variables are:

  • chk_acct
  • duration
  • credit_his
  • purpose
  • amount
  • saving_acct
  • installment_rate
  • sex
  • other_debtor
  • age
  • other_install
  • telephone
  • foreign
credit.glm0 <- glm(response ~ ., family = binomial, german_credit.train)
credit.glm.step <- step(credit.glm0, direction = "backward")
summary(credit.glm.step)
## 
## Call:
## glm(formula = response ~ chk_acct + duration + credit_his + purpose + 
##     amount + saving_acct + installment_rate + sex + other_debtor + 
##     age + other_install + telephone + foreign, family = binomial, 
##     data = german_credit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2263  -0.7082  -0.3799   0.6923   2.7670  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.608e+00  8.251e-01   1.949 0.051344 .  
## chk_acctA12       -3.628e-01  2.408e-01  -1.507 0.131894    
## chk_acctA13       -9.224e-01  3.882e-01  -2.376 0.017499 *  
## chk_acctA14       -1.942e+00  2.623e-01  -7.404 1.32e-13 ***
## duration           2.516e-02  1.000e-02   2.516 0.011868 *  
## credit_hisA31     -6.266e-01  6.195e-01  -1.012 0.311749    
## credit_hisA32     -1.190e+00  4.761e-01  -2.500 0.012422 *  
## credit_hisA33     -1.118e+00  5.413e-01  -2.065 0.038939 *  
## credit_hisA34     -1.695e+00  5.007e-01  -3.386 0.000710 ***
## purposeA41        -1.437e+00  4.026e-01  -3.568 0.000359 ***
## purposeA410       -2.149e+00  9.172e-01  -2.343 0.019151 *  
## purposeA42        -8.366e-01  2.850e-01  -2.936 0.003327 ** 
## purposeA43        -9.470e-01  2.748e-01  -3.447 0.000567 ***
## purposeA44        -5.495e-01  7.469e-01  -0.736 0.461921    
## purposeA45        -2.052e-01  6.364e-01  -0.322 0.747138    
## purposeA46         2.944e-01  4.299e-01   0.685 0.493510    
## purposeA48        -2.177e+00  1.216e+00  -1.790 0.073463 .  
## purposeA49        -8.948e-01  3.800e-01  -2.355 0.018531 *  
## amount             1.549e-04  4.937e-05   3.137 0.001704 ** 
## saving_acctA62    -5.153e-01  3.222e-01  -1.599 0.109744    
## saving_acctA63     1.053e-01  4.023e-01   0.262 0.793550    
## saving_acctA64    -8.765e-01  5.570e-01  -1.573 0.115607    
## saving_acctA65    -1.057e+00  2.927e-01  -3.613 0.000303 ***
## installment_rate   3.240e-01  9.618e-02   3.368 0.000756 ***
## sexA92             1.701e-01  4.374e-01   0.389 0.697452    
## sexA93            -4.627e-01  4.277e-01  -1.082 0.279346    
## sexA94            -5.573e-02  5.127e-01  -0.109 0.913440    
## other_debtorA102   6.377e-01  4.276e-01   1.491 0.135880    
## other_debtorA103  -1.177e+00  4.610e-01  -2.553 0.010671 *  
## age               -1.534e-02  9.226e-03  -1.663 0.096272 .  
## other_installA142 -4.377e-01  4.941e-01  -0.886 0.375685    
## other_installA143 -8.460e-01  2.578e-01  -3.281 0.001033 ** 
## telephoneA192     -3.358e-01  2.062e-01  -1.629 0.103378    
## foreignA202       -1.249e+00  6.531e-01  -1.912 0.055876 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 977.38  on 799  degrees of freedom
## Residual deviance: 724.74  on 766  degrees of freedom
## AIC: 792.74
## 
## Number of Fisher Scoring iterations: 5

stepwise variable selection using BIC

From stepwise variable selection method using BIC, the significant variables are:

  • chk_acct
  • duration
credit.glm.step.bic <- step(credit.glm0, k = log(nrow(german_credit.train)))
summary(credit.glm.step.bic)
## 
## Call:
## glm(formula = response ~ chk_acct + duration, family = binomial, 
##     data = german_credit.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5984  -0.8548  -0.4809   0.9769   2.3311  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.80900    0.19997  -4.046 5.22e-05 ***
## chk_acctA12 -0.45879    0.19986  -2.296  0.02170 *  
## chk_acctA13 -0.96408    0.35409  -2.723  0.00648 ** 
## chk_acctA14 -2.05961    0.22897  -8.995  < 2e-16 ***
## duration     0.03666    0.00684   5.360 8.34e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 977.38  on 799  degrees of freedom
## Residual deviance: 839.30  on 795  degrees of freedom
## AIC: 849.3
## 
## Number of Fisher Scoring iterations: 4

Chi-square test for significance of variables

Using drop-1 method to check variable importance, we find the significant variables as:

  • chk_acct
  • duration
  • credit_his
  • purpose
  • amount
  • saving_acct
  • installment_rate
  • other_debtor
  • other_install
  • foreign
drop1(credit.glm0, test ="Chi")
## Single term deletions
## 
## Model:
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + present_emp + installment_rate + sex + other_debtor + 
##     present_resid + property + age + other_install + housing + 
##     n_credits + job + n_people + telephone + foreign
##                  Df Deviance    AIC    LRT  Pr(>Chi)    
## <none>                710.25 808.25                     
## chk_acct          3   773.06 865.06 62.810 1.475e-13 ***
## duration          1   717.12 813.12  6.875 0.0087416 ** 
## credit_his        4   726.68 816.68 16.436 0.0024867 ** 
## purpose           9   741.63 821.63 31.387 0.0002541 ***
## amount            1   719.27 815.27  9.022 0.0026669 ** 
## saving_acct       4   725.67 815.67 15.423 0.0038996 ** 
## present_emp       4   716.30 806.30  6.057 0.1949295    
## installment_rate  1   720.37 816.37 10.125 0.0014626 ** 
## sex               3   716.91 808.91  6.663 0.0834359 .  
## other_debtor      2   718.77 812.77  8.522 0.0141106 *  
## present_resid     1   710.76 806.76  0.511 0.4745898    
## property          3   713.78 805.78  3.534 0.3164337    
## age               1   713.13 809.13  2.887 0.0893214 .  
## other_install     2   720.26 814.26 10.019 0.0066732 ** 
## housing           2   712.19 806.19  1.944 0.3783533    
## n_credits         1   711.78 807.78  1.531 0.2159470    
## job               3   711.59 803.59  1.346 0.7182282    
## n_people          1   710.78 806.78  0.533 0.4652324    
## telephone         1   713.09 809.09  2.842 0.0918461 .  
## foreign           1   715.08 811.08  4.838 0.0278409 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lasso variable selection

To get variable selection using LASSO, we first create matrix of the dataset.

factor_var <- c(1,3,4,6,7,9,10,12,14,15,17,19,20,21)
num_var <- c(2,5,8,11,13,16,18)
train2 <- german_credit.train
train2[num_var] <- scale(train2[num_var])
train2[factor_var] <- sapply(train2[factor_var] , as.numeric)

X.train <- as.matrix(train2[,1:20])
Y.train <- as.matrix(train2[,21])

We fit the LASSO model to our data. From the plot below, we see that as the value of lambda keeps on increasing, the coefficients for the variables tend to 0.

lasso.fit<- glmnet(x=X.train, y=Y.train, family = "binomial", alpha = 1)
plot(lasso.fit, xvar = "lambda", label=TRUE)

Using cross validation to find perfect lambda value

cv.lasso<- cv.glmnet(x=X.train, y=Y.train, family = "binomial", alpha = 1, nfolds = 10)
plot(cv.lasso)

Error associated with model with lambda=1se and coefficients of model

cv.lasso$lambda.1se
## [1] 0.0334437
coef(lasso.fit, s=cv.lasso$lambda.1se)
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)       1.394985809
## chk_acct         -0.466208930
## duration          0.245704522
## credit_his       -0.200888155
## purpose           .          
## amount            0.025455296
## saving_acct      -0.060184929
## present_emp      -0.010413263
## installment_rate  .          
## sex              -0.004068888
## other_debtor      .          
## present_resid     .          
## property          .          
## age               .          
## other_install    -0.099223291
## housing           .          
## n_credits         .          
## job               .          
## n_people          .          
## telephone         .          
## foreign           .

Final model for GLM

For our final model, we select the final variables:

  • chk_acct
  • duration
  • credit_his
  • amount
  • saving_acct
  • installment_rate
  • other_install
credit.glm.final <- glm(response ~ chk_acct + duration + credit_his + amount + saving_acct + other_install + installment_rate, family = binomial, german_credit.train)

Model Evaluation

In-sample misclassification rate

Keeping cutoff as 0.1667, we calculate the misclassification rate:

prob.glm1.insample <- predict(credit.glm.final, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)
mean(ifelse(german_credit.train$response != predicted.glm1.insample, 1, 0))
## [1] 0.4075

Confusion Matrix

Checking for the predictions and seeing the False Positive and False negative values from the below confusion matrix:

table(german_credit.train$response, predicted.glm1.insample, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth   0   1
##     0 261 299
##     1  27 213

ROC Plot

ROC Plot for the same is plotted below and the AUC is 0.7896875

roc.plot(german_credit.train$response == "1", prob.glm1.insample)
roc.plot(german_credit.train$response == "1", prob.glm1.insample)$roc.vol$Area

## [1] 0.7896875

Out of sample misclassification rate and AUC score

We get a misclassification rate of 0.395, and AUC of 0.7734524

prob.glm1.outsample <- predict(credit.glm.final, german_credit.test, type = "response")
predicted.glm1.outsample <- prob.glm1.outsample > 0.1667
predicted.glm1.outsample <- as.numeric(predicted.glm1.outsample)
table(german_credit.test$response, predicted.glm1.outsample, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth  0  1
##     0 68 72
##     1  7 53
mean(ifelse(german_credit.test$response != predicted.glm1.outsample, 1, 0))
## [1] 0.395
roc.plot(german_credit.test$response == "1", prob.glm1.outsample)
roc.plot(german_credit.test$response == "1", prob.glm1.outsample)$roc.vol$Area

## [1] 0.7734524

Asymmetric misclassification rate giving more penalty for false positives

In cases where we need to penalize the False Negative more than False Positive, we use a 5:1 penalty for misclassification and see an error rate of 0.535

cost1 <- function(r, pi) {
  weight1 = 5
  weight0 = 1
  c1 = (r == 1) & (pi < 0.17)  #logical vector - true if actual 1 but predict 0
  c0 = (r == 0) & (pi > 0.17)  #logical vecotr - true if actual 0 but predict 1
  return(mean(weight1 * c1 + weight0 * c0))
}

cost1(german_credit.test$response,predicted.glm1.outsample)
## [1] 0.535

Continuing in PART 2: