The German Credit Scoring data set contains classification of customers into good credit risk and bad risk based on several factors. The dataset contains 1000 observations of 21 variables. This dataset contains different factors which might contribute to a customer being classified as posing risk on a credit taken from the bank.There are 20 predictors into consideration to predict the class of the variable ‘response’.
set.seed(12383117)
library(verification)
library(dplyr)
library(tidyr)
library(MASS)
library(tidyverse)
library(ggcorrplot)
library(ggpubr)
german_credit <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
colnames(german_credit) <- c("chk_acct", "duration", "credit_his", "purpose",
"amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor",
"present_resid", "property", "age", "other_install", "housing", "n_credits",
"job", "n_people", "telephone", "foreign", "response")
# orginal response coding 1= good, 2 = bad we need 0 = good, 1 = bad
german_credit$response = german_credit$response - 1
We first check the summary statistics for the important variables.
summary(german_credit)
## chk_acct duration credit_his purpose amount
## A11:274 Min. : 4.0 A30: 40 A43 :280 Min. : 250
## A12:269 1st Qu.:12.0 A31: 49 A40 :234 1st Qu.: 1366
## A13: 63 Median :18.0 A32:530 A42 :181 Median : 2320
## A14:394 Mean :20.9 A33: 88 A41 :103 Mean : 3271
## 3rd Qu.:24.0 A34:293 A49 : 97 3rd Qu.: 3972
## Max. :72.0 A46 : 50 Max. :18424
## (Other): 55
## saving_acct present_emp installment_rate sex other_debtor
## A61:603 A71: 62 Min. :1.000 A91: 50 A101:907
## A62:103 A72:172 1st Qu.:2.000 A92:310 A102: 41
## A63: 63 A73:339 Median :3.000 A93:548 A103: 52
## A64: 48 A74:174 Mean :2.973 A94: 92
## A65:183 A75:253 3rd Qu.:4.000
## Max. :4.000
##
## present_resid property age other_install housing
## Min. :1.000 A121:282 Min. :19.00 A141:139 A151:179
## 1st Qu.:2.000 A122:232 1st Qu.:27.00 A142: 47 A152:713
## Median :3.000 A123:332 Median :33.00 A143:814 A153:108
## Mean :2.845 A124:154 Mean :35.55
## 3rd Qu.:4.000 3rd Qu.:42.00
## Max. :4.000 Max. :75.00
##
## n_credits job n_people telephone foreign
## Min. :1.000 A171: 22 Min. :1.000 A191:596 A201:963
## 1st Qu.:1.000 A172:200 1st Qu.:1.000 A192:404 A202: 37
## Median :1.000 A173:630 Median :1.000
## Mean :1.407 A174:148 Mean :1.155
## 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :4.000 Max. :2.000
##
## response
## Min. :0.0
## 1st Qu.:0.0
## Median :0.0
## Mean :0.3
## 3rd Qu.:1.0
## Max. :1.0
##
There are 13 categorical variables in the dataset.There are no missing values. The variable ‘response’ takes binary values - 0 for low risk(good) and 1 for high risk(bad).
We are also interested in checking the correlation between important variables and response.
factors.credit <- c(1,3,4,6,7,9,10,12,14,15,17,19,20)
german_credit[,factors.credit] <- lapply(german_credit[,factors.credit], factor)
num.credit <- c(2,5,8,11,13,16,18,21)
german_credit[num.credit] <- sapply(german_credit[num.credit], as.numeric)
cor.credit <- cor(german_credit[num.credit])
ggcorrplot(cor.credit,hc.order = TRUE,lab=TRUE) + ggtitle("Correlation between numeric variables") + theme(axis.text.x = element_text(angle = 90, vjust=0.5))
We can see that variables amount and duration have a positive correlation with response, while age impacts it negatively. Hence higher amount and duration imply the value of response will be be higher i.e. closer to 1. Hence, it increases the chance of high credit risk i.e. bad customer.
We now split our data set into two parts - one with customers who have been classified as good, and one with customers classified as bad. The data sets contain 700 and 300 observations respectively. The distribution of variables for each set is also checked.
#Boxplot for age
ggplot(german_credit,aes(x=as.factor(response),y=age))+ geom_boxplot(fill="blue") + labs(title='Age by risk classification',x='Risk Classification(1 for bad risk, 0 for good risk)',y='Age(in years)')
We can see that median age for people classified as bad is lower. Hence, younger people have more chance of being a bad customer and pose higher credit risk.
#Boxplot for amount
ggplot(german_credit,aes(x=as.factor(response),y=amount))+ geom_boxplot(fill="green") + labs(title='Credit Amount by risk classification',x='Risk Classification(1 for bad risk, 0 for good risk)',y='Credit Amount')
We can see that median credit amount for bad customers is slightly higher. This indicates that people who have higher credit amounts might be higher risk customers.
#Boxplot for duration
ggplot(german_credit,aes(x=as.factor(response),y=duration))+ geom_boxplot(fill="brown") + labs(title='Duration of credit by risk classification',x='Risk Classification(1 for bad risk, 0 for good risk)',y='Duration of credit')
We can see that customers classified as bad have higher median duration of credit.
We first split our dataset into training and test data set. We randomly select 80% of the observations as a part of the training data set and 20% of the observations as the test data set. The model will be built on the training data set and will be tested on the test data set.
#Create testing and training data set
subset <- sample(nrow(german_credit), nrow(german_credit) * 0.8)
credit.train <- german_credit[subset, ]
credit.test <- german_credit[-subset, ]
Since the response variable is qualitative i.e. we classify the customer as Good or Bad based on Credit Risk, we use the Logistic Regression technique for model creation.
We build two models. First model is the null model containing only the response variable and no predictors. The second model is the full model containing all 20 predictors.
Now we select the best model using the stepwise selection method - with AIC criterion.
credit.glm0 <- glm(response ~ ., family = "binomial", credit.train)
null <- glm(response~1,family = "binomial",credit.train)
full <- glm(response~.,family = "binomial",credit.train)
lm.AIC <- stepAIC(full,scope=list(lower=null,upper=full),k=2,direction = "both")
summary(lm.AIC)
##
## Call:
## glm(formula = response ~ chk_acct + duration + credit_his + purpose +
## amount + saving_acct + installment_rate + sex + other_debtor +
## other_install + housing + n_credits + telephone + foreign,
## family = "binomial", data = credit.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2517 -0.6770 -0.3673 0.6500 2.9525
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.444e-01 8.597e-01 0.517 0.605206
## chk_acctA12 -5.669e-01 2.469e-01 -2.296 0.021682 *
## chk_acctA13 -1.256e+00 4.167e-01 -3.015 0.002572 **
## chk_acctA14 -1.863e+00 2.574e-01 -7.238 4.57e-13 ***
## duration 2.462e-02 1.025e-02 2.402 0.016317 *
## credit_hisA31 1.951e-01 6.141e-01 0.318 0.750761
## credit_hisA32 -5.820e-01 4.787e-01 -1.216 0.224093
## credit_hisA33 -6.401e-01 5.240e-01 -1.221 0.221934
## credit_hisA34 -1.611e+00 4.844e-01 -3.326 0.000881 ***
## purposeA41 -1.786e+00 4.249e-01 -4.202 2.64e-05 ***
## purposeA410 -8.062e-01 8.125e-01 -0.992 0.321088
## purposeA42 -6.379e-01 2.856e-01 -2.233 0.025517 *
## purposeA43 -9.070e-01 2.807e-01 -3.231 0.001235 **
## purposeA44 -5.604e-01 8.712e-01 -0.643 0.520073
## purposeA45 -3.518e-01 6.582e-01 -0.535 0.592956
## purposeA46 2.195e-01 4.529e-01 0.485 0.627925
## purposeA48 -2.322e+00 1.253e+00 -1.853 0.063853 .
## purposeA49 -8.577e-01 3.791e-01 -2.263 0.023655 *
## amount 1.645e-04 4.875e-05 3.375 0.000738 ***
## saving_acctA62 -2.122e-01 3.172e-01 -0.669 0.503423
## saving_acctA63 -1.187e-01 4.246e-01 -0.280 0.779800
## saving_acctA64 -1.459e+00 5.789e-01 -2.520 0.011740 *
## saving_acctA65 -1.007e+00 2.939e-01 -3.426 0.000613 ***
## installment_rate 4.063e-01 9.829e-02 4.133 3.57e-05 ***
## sexA92 -1.406e-01 4.105e-01 -0.343 0.731931
## sexA93 -8.778e-01 4.028e-01 -2.179 0.029326 *
## sexA94 -3.046e-01 4.801e-01 -0.634 0.525783
## other_debtorA102 6.940e-01 4.291e-01 1.617 0.105834
## other_debtorA103 -1.628e+00 5.564e-01 -2.926 0.003429 **
## other_installA142 -3.622e-02 4.487e-01 -0.081 0.935662
## other_installA143 -6.043e-01 2.676e-01 -2.258 0.023942 *
## housingA152 -4.778e-01 2.560e-01 -1.866 0.062018 .
## housingA153 -6.258e-01 3.811e-01 -1.642 0.100558
## n_credits 3.105e-01 2.134e-01 1.455 0.145640
## telephoneA192 -3.195e-01 2.136e-01 -1.496 0.134639
## foreignA202 -1.066e+00 6.711e-01 -1.589 0.112125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 973.97 on 799 degrees of freedom
## Residual deviance: 704.81 on 764 degrees of freedom
## AIC: 776.81
##
## Number of Fisher Scoring iterations: 5
The model with AIC criterion for selection has 14 predictors. It drops the following variables - present_emp, present_resid, property, age, job, and people.
Using the model generated through logistic regression, we now plot the ROC curve (true positives vs. false positives) using the training data and find the Area Under the curve (AUC ). Higher value of AUC will indicate better model performance.
AIC.mod <- glm(formula = response ~ chk_acct + duration + credit_his + purpose +
amount + saving_acct + present_emp + installment_rate + sex +
other_debtor + property + age + other_install + n_credits +
foreign, family = "binomial", data = credit.train)
#ROC curve
in.prob <- predict(AIC.mod,type="response")
roc.plot(credit.train$response == "1", in.prob)$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8417043 4.067723e-53 NA
For the training dataset, the AUC value comes out to be 0.8417.
For the misclassification rate, we find an optimal cut-off probability derived using the asymmetric cost function (weight for high credit risk is chosen as 5 compared to 1 for low credit risk). The cost function implies that the wrongly classifying customers that are supposed to be classified as Bad costs 5 times than wrongly classifying customers supposed to be classified as Good. We plot the cost function in training set against the cut-off probability.
##Cost function
# define the searc grid from 0.01 to 0.99
searchgrid = seq(0.01, 0.99, 0.01)
# result is a 99x2 matrix, the 1st col stores the cut-off p, the 2nd column
# stores the cost
result <- cbind(searchgrid, NA)
# in the cost function, both r and pi are vectors, r=truth, pi=predicted
# probability
cost1 <- function(r, pi) {
weight1 = 5
weight0 = 1
c1 = (r == 1) & (pi < pcut) #logical vector - true if actual 1 but predict 0
c0 = (r == 0) & (pi > pcut) #logical vecotr - true if actual 0 but predict 1
return(mean(weight1 * c1 + weight0 * c0))
}
for (i in 1:length(searchgrid)) {
pcut <- result[i, 1]
# assign the cost to the 2nd col
result[i, 2] <- cost1(credit.train$response, predict(AIC.mod, type = "response"))
}
plot(result, ylab = "Cost in Training Set")
pc <- result[which(result[,2] == min(result[,2])),]
pc[1]
## searchgrid
## 0.16
The value of cut-off probability comes out to be 0.16.
We now calculate the misclassification rate using this value.
#In sample misclassification rate
credit.train$pred <- ifelse(in.prob>pc[1],1,0)
table(pred = credit.train$pred , true = credit.train$response)
## true
## pred 0 1
## 0 319 18
## 1 243 220
mean(credit.train$pred != credit.train$response)
## [1] 0.32625
The misclassification rate comes out to be 0.3265. This implies the model created using the training data is around 67.35% accurate in predicting the customer’s classification.
We now test the model created using the training data set on the test data set containing 20% of the observations. The test data set contains 800 observations of 21 variables. We will find the AUC and the asymmetric misclassification rate for this dataset as well.
We plot the ROC curve for the test data set.
model <- glm(formula = response ~ chk_acct + duration + credit_his + purpose +
amount + saving_acct + present_emp + installment_rate + sex +
other_debtor + property + age + other_install + n_credits +
foreign, family = "binomial", data = credit.test)
out.prob <- predict(model,type="response")
#ROC curve
roc.plot(credit.test$response == "1", out.prob)$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8699158 3.158491e-17 NA
The value of AUC comes out to be 0.8699.
We now calculate the asymmetric classification rate for the test data set. It is to be noted that we will be using the value of cut-off probability as 0.16(the value we had derived using the cost function).
credit.test$pred <- ifelse(out.prob>0.16,1,0)
table(pred = credit.test$pred , true = credit.test$response)
## true
## pred 0 1
## 0 79 5
## 1 59 57
mean(credit.test$pred != credit.test$response)
## [1] 0.32
The value of asymmetric misclassification rate comes out to be 0.32. This means our model is around 68% accurate in predicting the credit risk classification for customers in the test data set.
The final model created for classifying the customers based on credit risk using logistics regression contains 14 predictors. The in-sample and out-of-sample performance of this model is comparable. The model can accurately predict the credit risk classification for more than 67% of the customers in the data set.