The German credit scoring data is a dataset provided by Prof. Hogmann in the file german.data. The data set has information about 1000 individuals, on the basis of which they have been classified as risky or not. The variable response in the dataset corresponds to the risk label, 1 has been classified as bad and 2 has been classified as good.
In Part 1, we have performed Exploratory Data Analysis on the dataset and furthered it into building a Logistic Regression Model along with variable selection for the model building process.
In Part 2, we use CART, Bagging and Random Forest for model creation.
We have used the following libraries for our analysis:
library(knitr)
library(dplyr)
library(tidyr)
library(reshape2)
library(RColorBrewer)
library(GGally)
library(ggplot2)
library(caret)
library(glmnet)
library(boot)
library(verification)
We get the data from the link. We need to provide names for the columns and change the response labels to 1 and 0:
0 corresponding to a good credit record and 1 corresponding to a bad one (positive class).
german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
colnames(german_credit) = c("chk_acct", "duration", "credit_his", "purpose",
"amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor",
"present_resid", "property", "age", "other_install", "housing", "n_credits",
"job", "n_people", "telephone", "foreign", "response")
german_credit$response = german_credit$response - 1
german_credit$response <- as.factor(german_credit$response)
There is a total on 21 attributes in the dataset. Their descriptions and details have been tabulated below:
glimpse(german_credit)
## Observations: 1,000
## Variables: 21
## $ chk_acct <fct> A11, A12, A14, A11, A11, A14, A14, A12, A14, ...
## $ duration <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48...
## $ credit_his <fct> A34, A32, A34, A32, A33, A32, A32, A32, A32, ...
## $ purpose <fct> A43, A43, A46, A42, A40, A46, A42, A41, A43, ...
## $ amount <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 694...
## $ saving_acct <fct> A65, A61, A61, A61, A61, A65, A63, A61, A64, ...
## $ present_emp <fct> A75, A73, A74, A74, A73, A73, A75, A73, A74, ...
## $ installment_rate <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, ...
## $ sex <fct> A93, A92, A93, A93, A93, A93, A93, A93, A91, ...
## $ other_debtor <fct> A101, A101, A101, A103, A101, A101, A101, A10...
## $ present_resid <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, ...
## $ property <fct> A121, A121, A121, A122, A124, A124, A122, A12...
## $ age <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 2...
## $ other_install <fct> A143, A143, A143, A143, A143, A143, A143, A14...
## $ housing <fct> A152, A152, A152, A153, A153, A153, A152, A15...
## $ n_credits <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, ...
## $ job <fct> A173, A173, A172, A173, A173, A172, A173, A17...
## $ n_people <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ telephone <fct> A192, A191, A191, A191, A191, A192, A191, A19...
## $ foreign <fct> A201, A201, A201, A201, A201, A201, A201, A20...
## $ response <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, ...
We take the summary statistics of the dataset, the dataset has a total of 1000 observations with 21 variables, out of which 8 are numerical variables including the response and 13 are categorical variables with various levels. The summary statistics for the variables have been presented
summary(german_credit)
## chk_acct duration credit_his purpose amount
## A11:274 Min. : 4.0 A30: 40 A43 :280 Min. : 250
## A12:269 1st Qu.:12.0 A31: 49 A40 :234 1st Qu.: 1366
## A13: 63 Median :18.0 A32:530 A42 :181 Median : 2320
## A14:394 Mean :20.9 A33: 88 A41 :103 Mean : 3271
## 3rd Qu.:24.0 A34:293 A49 : 97 3rd Qu.: 3972
## Max. :72.0 A46 : 50 Max. :18424
## (Other): 55
## saving_acct present_emp installment_rate sex other_debtor
## A61:603 A71: 62 Min. :1.000 A91: 50 A101:907
## A62:103 A72:172 1st Qu.:2.000 A92:310 A102: 41
## A63: 63 A73:339 Median :3.000 A93:548 A103: 52
## A64: 48 A74:174 Mean :2.973 A94: 92
## A65:183 A75:253 3rd Qu.:4.000
## Max. :4.000
##
## present_resid property age other_install housing
## Min. :1.000 A121:282 Min. :19.00 A141:139 A151:179
## 1st Qu.:2.000 A122:232 1st Qu.:27.00 A142: 47 A152:713
## Median :3.000 A123:332 Median :33.00 A143:814 A153:108
## Mean :2.845 A124:154 Mean :35.55
## 3rd Qu.:4.000 3rd Qu.:42.00
## Max. :4.000 Max. :75.00
##
## n_credits job n_people telephone foreign response
## Min. :1.000 A171: 22 Min. :1.000 A191:596 A201:963 0:700
## 1st Qu.:1.000 A172:200 1st Qu.:1.000 A192:404 A202: 37 1:300
## Median :1.000 A173:630 Median :1.000
## Mean :1.407 A174:148 Mean :1.155
## 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :4.000 Max. :2.000
##
We get the following insights from our EDA of continuous variables:
amount.mean = german_credit %>% dplyr::select(amount, response) %>% group_by(response) %>% summarise(m =mean(amount))
duration.mean = german_credit %>% dplyr::select(duration, response) %>%group_by(response) %>% summarise( m =mean(duration))
ggplot(german_credit, aes(duration, fill=response)) +
geom_density(alpha=.5) + geom_vline(data=german_credit, aes(xintercept=duration.mean[,2], colour=response),
linetype="dashed", size=1)
test.m = german_credit[,c(2,5,8,13,16,18,21)]
test.m$response <- as.numeric(test.m$response)
ggplot(melt(german_credit[,c(2,21)]), aes(x = variable, y = value, fill = response)) + geom_boxplot() + xlab("response") + ylab("duration")
ggplot(german_credit, aes(factor(installment_rate), ..count..)) +
geom_bar(aes(fill = response), position = "dodge") + xlab("Installment Rates")
ggplot(german_credit, aes(amount, fill=response)) +
geom_density(alpha=.5) + geom_vline(data=german_credit, aes(xintercept=amount.mean[,2], colour=response),
linetype="dashed", size=1)
ggplot(melt(german_credit[,c(5,21)]), aes(x = variable, y = value, fill = response)) +
geom_boxplot() + xlab("response") + ylab("amount")
ggplot(melt(german_credit[,c(13,21)]), aes(x = variable, y = value, fill = response)) +
geom_boxplot()+ xlab("response") + ylab("age")
ggplot(melt(german_credit[,c(16,21)]), aes(x = variable, y = value, fill = response)) +
geom_boxplot()
We get the following insights from our EDA of categorical variables:
ggplot(german_credit, aes(chk_acct, ..count..)) +
geom_bar(aes(fill = response), position = "dodge")
ggplot(german_credit, aes(credit_his, ..count..)) +
geom_bar(aes(fill = response), position = "dodge")
ggplot(german_credit, aes(purpose, ..count..)) +
geom_bar(aes(fill = response), position = "dodge")
ggplot(german_credit, aes(saving_acct, ..count..)) +
geom_bar(aes(fill = response), position = "dodge")
ggplot(german_credit, aes(other_debtor, ..count..)) +
geom_bar(aes(fill = response), position = "dodge")
ggplot(german_credit, aes(sex, ..count..)) +
geom_bar(aes(fill = response), position = "dodge")
ggplot(german_credit, aes(other_install, ..count..)) +
geom_bar(aes(fill = response), position = "dodge")
ggplot(german_credit, aes(foreign, ..count..)) +
geom_bar(aes(fill = response), position = "dodge")
We build a logistic regression model to predict riskiness
Splitting our data into 80:20 Train test split using stratifiesd sampling to get equal amounts of data from each class
set.seed(12420424)
in.train <- createDataPartition(as.factor(german_credit$response), p=0.8, list=FALSE)
german_credit.train <- german_credit[in.train,]
german_credit.test <- german_credit[-in.train,]
From stepwise variable selection method using AIC, the significant variables are:
credit.glm0 <- glm(response ~ ., family = binomial, german_credit.train)
credit.glm.step <- step(credit.glm0, direction = "backward")
summary(credit.glm.step)
##
## Call:
## glm(formula = response ~ chk_acct + duration + credit_his + purpose +
## amount + saving_acct + installment_rate + sex + other_debtor +
## age + other_install + telephone + foreign, family = binomial,
## data = german_credit.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2263 -0.7082 -0.3799 0.6923 2.7670
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.608e+00 8.251e-01 1.949 0.051344 .
## chk_acctA12 -3.628e-01 2.408e-01 -1.507 0.131894
## chk_acctA13 -9.224e-01 3.882e-01 -2.376 0.017499 *
## chk_acctA14 -1.942e+00 2.623e-01 -7.404 1.32e-13 ***
## duration 2.516e-02 1.000e-02 2.516 0.011868 *
## credit_hisA31 -6.266e-01 6.195e-01 -1.012 0.311749
## credit_hisA32 -1.190e+00 4.761e-01 -2.500 0.012422 *
## credit_hisA33 -1.118e+00 5.413e-01 -2.065 0.038939 *
## credit_hisA34 -1.695e+00 5.007e-01 -3.386 0.000710 ***
## purposeA41 -1.437e+00 4.026e-01 -3.568 0.000359 ***
## purposeA410 -2.149e+00 9.172e-01 -2.343 0.019151 *
## purposeA42 -8.366e-01 2.850e-01 -2.936 0.003327 **
## purposeA43 -9.470e-01 2.748e-01 -3.447 0.000567 ***
## purposeA44 -5.495e-01 7.469e-01 -0.736 0.461921
## purposeA45 -2.052e-01 6.364e-01 -0.322 0.747138
## purposeA46 2.944e-01 4.299e-01 0.685 0.493510
## purposeA48 -2.177e+00 1.216e+00 -1.790 0.073463 .
## purposeA49 -8.948e-01 3.800e-01 -2.355 0.018531 *
## amount 1.549e-04 4.937e-05 3.137 0.001704 **
## saving_acctA62 -5.153e-01 3.222e-01 -1.599 0.109744
## saving_acctA63 1.053e-01 4.023e-01 0.262 0.793550
## saving_acctA64 -8.765e-01 5.570e-01 -1.573 0.115607
## saving_acctA65 -1.057e+00 2.927e-01 -3.613 0.000303 ***
## installment_rate 3.240e-01 9.618e-02 3.368 0.000756 ***
## sexA92 1.701e-01 4.374e-01 0.389 0.697452
## sexA93 -4.627e-01 4.277e-01 -1.082 0.279346
## sexA94 -5.573e-02 5.127e-01 -0.109 0.913440
## other_debtorA102 6.377e-01 4.276e-01 1.491 0.135880
## other_debtorA103 -1.177e+00 4.610e-01 -2.553 0.010671 *
## age -1.534e-02 9.226e-03 -1.663 0.096272 .
## other_installA142 -4.377e-01 4.941e-01 -0.886 0.375685
## other_installA143 -8.460e-01 2.578e-01 -3.281 0.001033 **
## telephoneA192 -3.358e-01 2.062e-01 -1.629 0.103378
## foreignA202 -1.249e+00 6.531e-01 -1.912 0.055876 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 977.38 on 799 degrees of freedom
## Residual deviance: 724.74 on 766 degrees of freedom
## AIC: 792.74
##
## Number of Fisher Scoring iterations: 5
From stepwise variable selection method using BIC, the significant variables are:
credit.glm.step.bic <- step(credit.glm0, k = log(nrow(german_credit.train)))
summary(credit.glm.step.bic)
##
## Call:
## glm(formula = response ~ chk_acct + duration, family = binomial,
## data = german_credit.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5984 -0.8548 -0.4809 0.9769 2.3311
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.80900 0.19997 -4.046 5.22e-05 ***
## chk_acctA12 -0.45879 0.19986 -2.296 0.02170 *
## chk_acctA13 -0.96408 0.35409 -2.723 0.00648 **
## chk_acctA14 -2.05961 0.22897 -8.995 < 2e-16 ***
## duration 0.03666 0.00684 5.360 8.34e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 977.38 on 799 degrees of freedom
## Residual deviance: 839.30 on 795 degrees of freedom
## AIC: 849.3
##
## Number of Fisher Scoring iterations: 4
Using drop-1 method to check variable importance, we find the significant variables as:
drop1(credit.glm0, test ="Chi")
## Single term deletions
##
## Model:
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + present_emp + installment_rate + sex + other_debtor +
## present_resid + property + age + other_install + housing +
## n_credits + job + n_people + telephone + foreign
## Df Deviance AIC LRT Pr(>Chi)
## <none> 710.25 808.25
## chk_acct 3 773.06 865.06 62.810 1.475e-13 ***
## duration 1 717.12 813.12 6.875 0.0087416 **
## credit_his 4 726.68 816.68 16.436 0.0024867 **
## purpose 9 741.63 821.63 31.387 0.0002541 ***
## amount 1 719.27 815.27 9.022 0.0026669 **
## saving_acct 4 725.67 815.67 15.423 0.0038996 **
## present_emp 4 716.30 806.30 6.057 0.1949295
## installment_rate 1 720.37 816.37 10.125 0.0014626 **
## sex 3 716.91 808.91 6.663 0.0834359 .
## other_debtor 2 718.77 812.77 8.522 0.0141106 *
## present_resid 1 710.76 806.76 0.511 0.4745898
## property 3 713.78 805.78 3.534 0.3164337
## age 1 713.13 809.13 2.887 0.0893214 .
## other_install 2 720.26 814.26 10.019 0.0066732 **
## housing 2 712.19 806.19 1.944 0.3783533
## n_credits 1 711.78 807.78 1.531 0.2159470
## job 3 711.59 803.59 1.346 0.7182282
## n_people 1 710.78 806.78 0.533 0.4652324
## telephone 1 713.09 809.09 2.842 0.0918461 .
## foreign 1 715.08 811.08 4.838 0.0278409 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To get variable selection using LASSO, we first create matrix of the dataset.
factor_var <- c(1,3,4,6,7,9,10,12,14,15,17,19,20,21)
num_var <- c(2,5,8,11,13,16,18)
train2 <- german_credit.train
train2[num_var] <- scale(train2[num_var])
train2[factor_var] <- sapply(train2[factor_var] , as.numeric)
X.train <- as.matrix(train2[,1:20])
Y.train <- as.matrix(train2[,21])
We fit the LASSO model to our data. From the plot below, we see that as the value of lambda keeps on increasing, the coefficients for the variables tend to 0.
lasso.fit<- glmnet(x=X.train, y=Y.train, family = "binomial", alpha = 1)
plot(lasso.fit, xvar = "lambda", label=TRUE)
Using cross validation to find perfect lambda value
cv.lasso<- cv.glmnet(x=X.train, y=Y.train, family = "binomial", alpha = 1, nfolds = 10)
plot(cv.lasso)
Error associated with model with lambda=1se and coefficients of model
cv.lasso$lambda.1se
## [1] 0.0334437
coef(lasso.fit, s=cv.lasso$lambda.1se)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.394985809
## chk_acct -0.466208930
## duration 0.245704522
## credit_his -0.200888155
## purpose .
## amount 0.025455296
## saving_acct -0.060184929
## present_emp -0.010413263
## installment_rate .
## sex -0.004068888
## other_debtor .
## present_resid .
## property .
## age .
## other_install -0.099223291
## housing .
## n_credits .
## job .
## n_people .
## telephone .
## foreign .
For our final model, we select the final variables:
credit.glm.final <- glm(response ~ chk_acct + duration + credit_his + amount + saving_acct + other_install + installment_rate, family = binomial, german_credit.train)
Keeping cutoff as 0.1667, we calculate the misclassification rate:
prob.glm1.insample <- predict(credit.glm.final, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)
mean(ifelse(german_credit.train$response != predicted.glm1.insample, 1, 0))
## [1] 0.4075
Checking for the predictions and seeing the False Positive and False negative values from the below confusion matrix:
table(german_credit.train$response, predicted.glm1.insample, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 261 299
## 1 27 213
ROC Plot for the same is plotted below and the AUC is 0.7896875
roc.plot(german_credit.train$response == "1", prob.glm1.insample)
roc.plot(german_credit.train$response == "1", prob.glm1.insample)$roc.vol$Area
## [1] 0.7896875
We get a misclassification rate of 0.395, and AUC of 0.7734524
prob.glm1.outsample <- predict(credit.glm.final, german_credit.test, type = "response")
predicted.glm1.outsample <- prob.glm1.outsample > 0.1667
predicted.glm1.outsample <- as.numeric(predicted.glm1.outsample)
table(german_credit.test$response, predicted.glm1.outsample, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 68 72
## 1 7 53
mean(ifelse(german_credit.test$response != predicted.glm1.outsample, 1, 0))
## [1] 0.395
roc.plot(german_credit.test$response == "1", prob.glm1.outsample)
roc.plot(german_credit.test$response == "1", prob.glm1.outsample)$roc.vol$Area
## [1] 0.7734524
In cases where we need to penalize the False Negative more than False Positive, we use a 5:1 penalty for misclassification and see an error rate of 0.535
cost1 <- function(r, pi) {
weight1 = 5
weight0 = 1
c1 = (r == 1) & (pi < 0.17) #logical vector - true if actual 1 but predict 0
c0 = (r == 0) & (pi > 0.17) #logical vecotr - true if actual 0 but predict 1
return(mean(weight1 * c1 + weight0 * c0))
}
cost1(german_credit.test$response,predicted.glm1.outsample)
## [1] 0.535