Content:
It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Thus, I wrote a small Python script to convert it into a readable CSV file. Several columns are simply ignored, because in my opinion either they are not important or their descriptions are obscure. The selected attributes are:
Age (numeric) Sex (text: male, female) Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled) Housing (text: own, rent, or free) Saving accounts (text - little, moderate, quite rich, rich) Checking account (numeric, in DM - Deutsch Mark) Credit amount (numeric, in DM) Duration (numeric, in month) Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
# Packages used
require(mlr)
require(ggplot2)
require(dplyr)
require(ggcorrplot)
require(caret)
# Reading the database
df = read.csv('german_credit_risk_target.csv')
head(df, n = 10)
## X Age Sex Job Housing Saving.accounts Checking.account Credit.amount
## 1 0 67 male 2 own <NA> little 1169
## 2 1 22 female 2 own little moderate 5951
## 3 2 49 male 1 own little <NA> 2096
## 4 3 45 male 2 free little little 7882
## 5 4 53 male 2 free little little 4870
## 6 5 35 male 1 free <NA> <NA> 9055
## 7 6 53 male 2 own quite rich <NA> 2835
## 8 7 35 male 3 rent little moderate 6948
## 9 8 61 male 1 own rich <NA> 3059
## 10 9 28 male 3 own little moderate 5234
## Duration Purpose Risk
## 1 6 radio/TV good
## 2 48 radio/TV bad
## 3 12 education good
## 4 42 furniture/equipment good
## 5 24 car bad
## 6 36 education good
## 7 24 furniture/equipment good
## 8 36 car good
## 9 12 radio/TV good
## 10 30 car bad
Getting some information about the variables
str(df)
## 'data.frame': 1000 obs. of 11 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
## $ Job : int 2 2 1 2 2 1 2 3 1 3 ...
## $ Housing : Factor w/ 3 levels "free","own","rent": 2 2 2 1 1 1 2 3 2 2 ...
## $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: NA 1 1 1 1 NA 3 1 4 1 ...
## $ Checking.account: Factor w/ 3 levels "little","moderate",..: 1 2 NA 1 1 NA NA 2 NA 2 ...
## $ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Purpose : Factor w/ 8 levels "business","car",..: 6 6 4 5 2 4 5 2 6 2 ...
## $ Risk : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...
Look! The variable ‘Job’ is an int, but should be Factor. So we’re gonna turn it into Factor using the description offer by who posted the dataset on Kaggle.
df$Job = factor(df$Job, levels = c(0, 1, 2, 3), labels = c('unskilled and non-resident',
'unskilled and resident', 'skilled',
'highly skilled'))
str(df)
## 'data.frame': 1000 obs. of 11 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
## $ Job : Factor w/ 4 levels "unskilled and non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ Housing : Factor w/ 3 levels "free","own","rent": 2 2 2 1 1 1 2 3 2 2 ...
## $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: NA 1 1 1 1 NA 3 1 4 1 ...
## $ Checking.account: Factor w/ 3 levels "little","moderate",..: 1 2 NA 1 1 NA NA 2 NA 2 ...
## $ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Purpose : Factor w/ 8 levels "business","car",..: 6 6 4 5 2 4 5 2 6 2 ...
## $ Risk : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...
Besides, we can see that ‘Saving.accounts’ and ‘Checking.account’ have soma NA’s, but we don’t know if those are the only one’s. Let’s check!
Verifying if there is any other variable with NA’s:
apply(df, 2, function(x) any(is.na(x)))
## X Age Sex Job
## FALSE FALSE FALSE FALSE
## Housing Saving.accounts Checking.account Credit.amount
## FALSE TRUE TRUE FALSE
## Duration Purpose Risk
## FALSE FALSE FALSE
See that the ‘Saving.accounts’ and ‘Checking.account’ columns are the only ones with NA’s, so we’re going to use a method to impute values on thoses NA’s values.
set.seed(100)
imputer = mlr::impute(df, target = "Risk",
cols = list(Saving.accounts = imputeLearner("classif.rpart"),
Checking.account = imputeLearner("classif.rpart")))
df = imputer$data
head(df, n = 10)
## X Age Sex Job Housing Saving.accounts Checking.account
## 1 0 67 male skilled own little little
## 2 1 22 female skilled own little moderate
## 3 2 49 male unskilled and resident own little little
## 4 3 45 male skilled free little little
## 5 4 53 male skilled free little little
## 6 5 35 male unskilled and resident free little little
## 7 6 53 male skilled own quite rich little
## 8 7 35 male highly skilled rent little moderate
## 9 8 61 male unskilled and resident own rich moderate
## 10 9 28 male highly skilled own little moderate
## Credit.amount Duration Purpose Risk
## 1 1169 6 radio/TV good
## 2 5951 48 radio/TV bad
## 3 2096 12 education good
## 4 7882 42 furniture/equipment good
## 5 4870 24 car bad
## 6 9055 36 education good
## 7 2835 24 furniture/equipment good
## 8 6948 36 car good
## 9 3059 12 radio/TV good
## 10 5234 30 car bad
str(df)
## 'data.frame': 1000 obs. of 11 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ...
## $ Job : Factor w/ 4 levels "unskilled and non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ Housing : Factor w/ 3 levels "free","own","rent": 2 2 2 1 1 1 2 3 2 2 ...
## $ Saving.accounts : Factor w/ 4 levels "little","moderate",..: 1 1 1 1 1 1 3 1 4 1 ...
## $ Checking.account: Factor w/ 3 levels "little","moderate",..: 1 2 1 1 1 1 1 2 2 2 ...
## $ Credit.amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Purpose : Factor w/ 8 levels "business","car",..: 6 6 4 5 2 4 5 2 6 2 ...
## $ Risk : Factor w/ 2 levels "bad","good": 2 1 2 2 1 2 2 2 2 1 ...
Now we’re gonna do some exploratory analysis to understand our dataset.
ggplot(data = df, aes(x = Job, fill = Risk)) + geom_bar(position='fill') +
labs(y = 'Count', x = 'Fig. 1: Count of Bad and Good risk by kind of Job') +
theme(axis.text.x = element_text(angle=10, vjust=0.6)) +
scale_fill_manual(values=c('red', 'blue'))

ggplot(data = df, aes(x = Housing, fill = Risk)) + geom_bar(position='fill') +
labs(y = 'Count', x = 'Fig. 2: Count of Bad and Good risk by Housing') +
scale_fill_manual(values=c('red', 'blue'))

ggplot(data = df, aes(x = Saving.accounts, fill = Risk)) + geom_bar(position='fill') +
labs(y = 'Count', x = 'Fig. 3: Count of Bad and Good risk by Saving Accounts') +
scale_fill_manual(values=c('red', 'blue'))

We can see that ‘little’ and ‘moderate’ are very similar, as ‘quite rich’ and ‘rich’. So we can make two categories by four to simplify our model.
df$Saving.accounts = as.factor(if_else(df$Saving.accounts == 'little' | df$Saving.accounts == 'moderate',
'little/moderate', 'quite rich/rich'))
ggplot(data = df, aes(x = Saving.accounts, fill = Risk)) + geom_bar(position='fill') +
labs(y = 'Count', x = 'Fig. 3: Count of Bad and Good risk by Saving Accounts') +
scale_fill_manual(values=c('red', 'blue'))

ggplot(data = df, aes(x = Checking.account, fill = Risk)) + geom_bar(position='fill') +
labs(y = 'Count', x = 'Fig. 4: Count of Bad and Good risk by Checking Accounts') +
scale_fill_manual(values=c('red', 'blue'))

ggplot(data = df, aes(x = Purpose, fill = Risk)) + geom_bar(position=position_dodge()) +
labs(y = 'Count', x = 'Fig. 5: Count of Bad and Good risk by Purpose') +
theme(axis.text.x = element_text(angle=50, vjust=0.6)) + ylim(0, 240) +
scale_fill_manual(values=c('red', 'blue'))

bad_risk = df %>% filter(Risk == 'bad')
good_risk = df %>% filter(Risk == 'good')
ggplot(data = bad_risk, aes(x = Age)) + geom_histogram(fill = 'blue',
breaks = c(seq(15, 75, 5))) +
labs(y = 'Count', x = 'Age', caption = 'Fig. 6: Ages of the people with bad risk distribution')

ggplot(data = good_risk, aes(x = Age)) + geom_histogram(fill = 'blue',
breaks = c(seq(15, 75, 5))) +
labs(y = 'Count', x = 'Age', caption = 'Fig. 7: Ages of the people with good risk distribution')

ggplot(data = good_risk, aes(x = Age)) + geom_histogram(fill = 'blue',
breaks = c(seq(15, 75, 5))) +
labs(y = 'Count', x = 'Duration', caption = 'Fig. 8: Duration with good risk distribution')

ggplot(data = bad_risk, aes(x = Age)) + geom_histogram(fill = 'blue',
breaks = c(seq(15, 75, 5))) +
labs(y = 'Count', x = 'Duration', caption = 'Fig. 9: Duration with bad risk distribution')

ggplot(data = bad_risk, aes(x = Credit.amount)) + geom_histogram(fill = 'blue',
breaks = c(seq(250, 18424, 828))) +
labs(y = 'Credit amount', x = 'Duration', caption = 'Fig. 10: Credit amount with bad risk distribution')

ggplot(data = good_risk, aes(x = Credit.amount)) + geom_histogram(fill = 'blue', breaks = c(seq(250, 18424, 828))) +
labs(y = 'Credit amount', x = 'Duration', caption = 'Fig. 11: Credit amount with good risk distribution')

There is 700 people classified as ‘good risk’ and only 300 peole as ‘bad risk’, so it’s quite hard to compare and we will have problems to create our predictive model.
In Figure 1 we can see that the proportion of good and bad risk by kind of job are very similar, so we shouldn’t use it in our model;
In Figure 2, between free and rent categories, as the proportion as quite the same, they might not help us a lot, but own may help;
In Figure 3, as expected, the proportion of quite rich and rich as good risk is bigger than the others. So I united quite rich/rich and little/moderate;
In Figure 4, we can see different proportion for each class. It’ll help our prediction;
In Figure 5, some categories are very similar, but others are very distinct. We shall use this variable;
In Figure 6 and 7, we can see that both distribution are very similar. So this variable won’t be used;
In Figure 8 and 9, there are some difference in the distributions. So we will use it;
In Figure 10 and 11, the distributions are not very similar. So we might use this variable.
Let’s create out model!
df1 = select(df, -X ,-Age, -Job)
# Splitting train and test samples
set.seed(100)
inTrain = createDataPartition(df1$Risk, p = 0.7, list = F)
train = df[inTrain, ]
test = df[-inTrain, ]
We’ll create the same model with three different resampling methods and check what is better.
# Bootstrapp resampling method:
control = trainControl(method = 'boot', number = 25)
# Creating the model 1:
set.seed(80)
model1 = train(Risk ~ ., data = train, method = 'rpart', trControl = control)
model1
## CART
##
## 700 samples
## 10 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 700, 700, 700, 700, 700, 700, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01111111 0.6776081 0.15196227
## 0.02857143 0.6973115 0.11717655
## 0.04047619 0.7007476 0.09295091
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04047619.
# Cross-Validation resampling method:
control = trainControl(method = 'cv', number = 10)
# Creating model 2:
set.seed(80)
model2 = train(Risk ~ ., data = train, method = 'rpart', trControl = control)
model2
## CART
##
## 700 samples
## 10 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 630, 630, 630, 630, 630, 630, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01111111 0.7142857 0.19557444
## 0.02857143 0.7128571 0.13292981
## 0.04047619 0.6928571 0.05810254
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01111111.
# Repeated k-fold Cross-Validation resampling method:
control = trainControl(method = 'repeatedcv', number = 5, repeats = 10)
# Creating model 3:
set.seed(80)
model3 = train(Risk ~ ., data = train, method = 'rpart', trControl = control)
model3
## CART
##
## 700 samples
## 10 predictor
## 2 classes: 'bad', 'good'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 560, 560, 560, 560, 560, 560, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01111111 0.6947143 0.17763394
## 0.02857143 0.7082857 0.13516447
## 0.04047619 0.6982857 0.06785114
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02857143.
# Testing the model 1:
prev = predict(model1, test)
confusionMatrix(prev, test$Risk)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 0 0
## good 90 210
##
## Accuracy : 0.7
## 95% CI : (0.6447, 0.7513)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.5284
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0
## Specificity : 1.0
## Pos Pred Value : NaN
## Neg Pred Value : 0.7
## Prevalence : 0.3
## Detection Rate : 0.0
## Detection Prevalence : 0.0
## Balanced Accuracy : 0.5
##
## 'Positive' Class : bad
##
# Testing the model 2:
prev = predict(model2, test)
confusionMatrix(prev, test$Risk)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 18 17
## good 72 193
##
## Accuracy : 0.7033
## 95% CI : (0.6481, 0.7545)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.4782
##
## Kappa : 0.1442
##
## Mcnemar's Test P-Value : 1.041e-08
##
## Sensitivity : 0.2000
## Specificity : 0.9190
## Pos Pred Value : 0.5143
## Neg Pred Value : 0.7283
## Prevalence : 0.3000
## Detection Rate : 0.0600
## Detection Prevalence : 0.1167
## Balanced Accuracy : 0.5595
##
## 'Positive' Class : bad
##
# Testing the model 3:
prev = predict(model3, test)
confusionMatrix(prev, test$Risk)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 15 17
## good 75 193
##
## Accuracy : 0.6933
## 95% CI : (0.6378, 0.745)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.6264
##
## Kappa : 0.1051
##
## Mcnemar's Test P-Value : 2.804e-09
##
## Sensitivity : 0.1667
## Specificity : 0.9190
## Pos Pred Value : 0.4687
## Neg Pred Value : 0.7201
## Prevalence : 0.3000
## Detection Rate : 0.0500
## Detection Prevalence : 0.1067
## Balanced Accuracy : 0.5429
##
## 'Positive' Class : bad
##
results = tibble(Models = c('Model 1', 'Model 2', 'Model 3' ),
'Train Accuracy' = c(0.7007476, 0.7142857, 0.7082857),
'Test Accuracy' = c(0.7000, 0.7033, 0.6933))
results
## # A tibble: 3 x 3
## Models `Train Accuracy` `Test Accuracy`
## <chr> <dbl> <dbl>
## 1 Model 1 0.701 0.7
## 2 Model 2 0.714 0.703
## 3 Model 3 0.708 0.693
As we can see in the table above, the model 2 is the better model. Our model is great predicting ‘good risk’, but very bad predicting ‘bad risk’.
Obs: This is not a good dataset to create predictive models because it’s not balanced, most of the target are ‘good risk’. What would you do to improve de accuracy?