“It takes money to make more money”. As an individual or company when we want to lend money, we set some critical parameters or guidelines to understand the credit risk. In this project, our aim is to analyze good and bad credit risk associated with individuals. The purpose of this stage of the project is to build classifiers which will help in predicting whether or not an individual has good or bad credit risk. This will be based on German Credit Dataset, which was sourced from UCI Machine Learning Repository. The project is segregated into two distinct stages. Stage 1 focusses on creating classifiers which will help the machine learning process to provide prediction. This stage will be comprise of data preprocessing, exploration and visualisation. In stage 2, we will focus on model building. The report is divided into the following parts: 1. Dataset: Which talks about the data 2. Pre-Processing: Which takes one through the required steps to clean and tidy the data for exploration. 3. Exploration: Where we show the dependencies and key features of the data 4. Summary: Where we summarise the report and findings.
The variable descriptions for the dataset are given below: 1. Checking: <0, 0
The following R packages were used for this stage of the project.
library(readr)
library(ggplot2)
library(dplyr)
library(mlr)
library(tidyverse)
library(GGally)
library(cowplot)
We read the dataset and renamed the variables for convenience and ease of usage as there were 20 distinct variables in the dataset. We eventually convert the character variables to factors (categorical data).
credit <- read_csv("credit.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## duration = col_integer(),
## credit_amount = col_integer(),
## installment_commitment = col_integer(),
## residence_since = col_integer(),
## age = col_integer(),
## existing_credits = col_integer(),
## num_dependents = col_integer()
## )
## See spec(...) for full column specifications.
names(credit) <- c('Checking', 'Duration', 'CreditHistory', 'Purpose', 'CreditAmount',
'Saving', 'Employment', 'Installment', 'Status', 'OtherParties', 'Residence',
'Asset', 'Age', 'OtherPlans', 'Housing', 'ExistingCredits', 'JobType', 'Dependants',
'Phone', 'Foreign', 'CreditRisk')
We use the str and summarizeColumns functions to understand the dataset a bit further. From these we noticed the following: 1. All the character variables had a lot of white space. 2. There were no missing values in the dataset.
str(credit, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 21 variables:
## $ Checking : chr "'<0'" "'0<=X<200'" "'no checking'" "'<0'" ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ CreditHistory : chr "'critical/other existing credit'" "'existing paid'" "'critical/other existing credit'" "'existing paid'" ...
## $ Purpose : chr "radio/tv" "radio/tv" "education" "furniture/equipment" ...
## $ CreditAmount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Saving : chr "'no known savings'" "'<100'" "'<100'" "'<100'" ...
## $ Employment : chr "'>=7'" "'1<=X<4'" "'4<=X<7'" "'4<=X<7'" ...
## $ Installment : int 4 2 2 2 3 2 3 2 2 4 ...
## $ Status : chr "'male single'" "female div/dep/mar" "'male single'" "'male single'" ...
## $ OtherParties : chr "none" "none" "none" "guarantor" ...
## $ Residence : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Asset : chr "'real estate'" "'real estate'" "'real estate'" "'life insurance'" ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ OtherPlans : chr "none" "none" "none" "none" ...
## $ Housing : chr "own" "own" "own" "'for free'" ...
## $ ExistingCredits: int 2 1 1 1 2 1 1 1 1 2 ...
## $ JobType : chr "skilled" "skilled" "'unskilled resident'" "skilled" ...
## $ Dependants : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Phone : chr "yes" "none" "none" "none" ...
## $ Foreign : chr "yes" "yes" "yes" "yes" ...
## $ CreditRisk : chr "good" "bad" "good" "good" ...
summarizeColumns(credit) %>% knitr::kable( caption = 'Table 1. Feature Summary before Data Preprocessing')
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| Checking | character | 0 | NA | 0.6060000 | NA | NA | 63 | 394 | 4 |
| Duration | integer | 0 | 20.903 | 12.0588145 | 18.0 | 8.8956 | 4 | 72 | 0 |
| CreditHistory | character | 0 | NA | 0.4700000 | NA | NA | 40 | 530 | 5 |
| Purpose | character | 0 | NA | 0.7200000 | NA | NA | 9 | 280 | 10 |
| CreditAmount | integer | 0 | 3271.258 | 2822.7368760 | 2319.5 | 1627.1535 | 250 | 18424 | 0 |
| Saving | character | 0 | NA | 0.3970000 | NA | NA | 48 | 603 | 5 |
| Employment | character | 0 | NA | 0.6610000 | NA | NA | 62 | 339 | 5 |
| Installment | integer | 0 | 2.973 | 1.1187147 | 3.0 | 1.4826 | 1 | 4 | 0 |
| Status | character | 0 | NA | 0.4520000 | NA | NA | 50 | 548 | 4 |
| OtherParties | character | 0 | NA | 0.0930000 | NA | NA | 41 | 907 | 3 |
| Residence | integer | 0 | 2.845 | 1.1037179 | 3.0 | 1.4826 | 1 | 4 | 0 |
| Asset | character | 0 | NA | 0.6680000 | NA | NA | 154 | 332 | 4 |
| Age | integer | 0 | 35.546 | 11.3754686 | 33.0 | 10.3782 | 19 | 75 | 0 |
| OtherPlans | character | 0 | NA | 0.1860000 | NA | NA | 47 | 814 | 3 |
| Housing | character | 0 | NA | 0.2870000 | NA | NA | 108 | 713 | 3 |
| ExistingCredits | integer | 0 | 1.407 | 0.5776545 | 1.0 | 0.0000 | 1 | 4 | 0 |
| JobType | character | 0 | NA | 0.3700000 | NA | NA | 22 | 630 | 4 |
| Dependants | integer | 0 | 1.155 | 0.3620858 | 1.0 | 0.0000 | 1 | 2 | 0 |
| Phone | character | 0 | NA | 0.4040000 | NA | NA | 404 | 596 | 2 |
| Foreign | character | 0 | NA | 0.0370000 | NA | NA | 37 | 963 | 2 |
| CreditRisk | character | 0 | NA | 0.3000000 | NA | NA | 300 | 700 | 2 |
We next need to get rid of the excessive white space for all character variables.
credit[, sapply(credit, is.character )] <- sapply( credit[, sapply(credit, is.character )], trimws)
We next check the levels for each character feature and the following points stood out a lot. 1. There were no visible missing values. 2. There were only 37 non foreign working individuals. 3. For Purpose we merged domestic appliance, furnture/equipment, repairs, radio/tv and retraining as household and new car and used car as Cars. 4. For Status, we saw that there was only 1 class for females so we renamed it female, and we merged the 3 classes of males to Male. 5. There were 200 unskilled individuals. 6. Only 49 people (4.9%) had all paid dues. 7. The maximum number of loans were for Radio/TV or for a new car. 8. Almost 60% of the individuals did not have a phone.
sapply( credit[ sapply(credit, is.character)], table)
## $Checking
##
## '<0' '>=200' '0<=X<200' 'no checking'
## 274 63 269 394
##
## $CreditHistory
##
## 'all paid' 'critical/other existing credit'
## 49 293
## 'delayed previously' 'existing paid'
## 88 530
## 'no credits/all paid'
## 40
##
## $Purpose
##
## business domestic appliance education
## 97 12 50
## furniture/equipment new car other
## 181 234 12
## radio/tv repairs retraining
## 280 22 9
## used car
## 103
##
## $Saving
##
## '<100' '>=1000' '100<=X<500'
## 603 48 103
## '500<=X<1000' 'no known savings'
## 63 183
##
## $Employment
##
## '<1' '>=7' '1<=X<4' '4<=X<7' unemployed
## 172 253 339 174 62
##
## $Status
##
## 'male div/sep' 'male mar/wid' 'male single'
## 50 92 548
## female div/dep/mar
## 310
##
## $OtherParties
##
## 'co applicant' guarantor none
## 41 52 907
##
## $Asset
##
## 'life insurance' 'no known property' 'real estate'
## 232 154 282
## car
## 332
##
## $OtherPlans
##
## bank none stores
## 139 814 47
##
## $Housing
##
## 'for free' own rent
## 108 713 179
##
## $JobType
##
## 'high qualif/self emp/mgmt' 'unemp/unskilled non res'
## 148 22
## 'unskilled resident' skilled
## 200 630
##
## $Phone
##
## none yes
## 596 404
##
## $Foreign
##
## no yes
## 37 963
##
## $CreditRisk
##
## bad good
## 300 700
We kept the priginal features intact to ensure data does not get affected and created new descriptive features from them. Each of the newly created feature had its name end with 1 to distinct itself from the original.
credit <- credit %>%
mutate(Purpose1 = ifelse( Purpose %in% c('domestic appliance', 'furniture/equipment', 'radio/tv', 'repairs', 'retraining'), 'Household',
ifelse( Purpose %in% c('new car', 'used car'), 'Cars',
ifelse( grepl('business', Purpose), 'Business',
ifelse( grepl('education', Purpose), 'Education',
ifelse( grepl('other', Purpose), 'Other', Purpose))))),
Status1 = ifelse( Status != 'female div/dep/mar', 'Male', 'Female')
)
## Warning: package 'bindrcpp' was built under R version 3.4.4
We next changed all the character variables to factors.
credit[, sapply( credit, is.character )] <- lapply( credit[, sapply( credit, is.character )], factor)
Table 2 represents the summary statistics of the data after preprocesseing.
summarizeColumns(credit) %>% knitr::kable( caption = 'Table 2. Feature Summary after Data Preprocessing' )
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| Checking | factor | 0 | NA | 0.6060000 | NA | NA | 63 | 394 | 4 |
| Duration | integer | 0 | 20.903 | 12.0588145 | 18.0 | 8.8956 | 4 | 72 | 0 |
| CreditHistory | factor | 0 | NA | 0.4700000 | NA | NA | 40 | 530 | 5 |
| Purpose | factor | 0 | NA | 0.7200000 | NA | NA | 9 | 280 | 10 |
| CreditAmount | integer | 0 | 3271.258 | 2822.7368760 | 2319.5 | 1627.1535 | 250 | 18424 | 0 |
| Saving | factor | 0 | NA | 0.3970000 | NA | NA | 48 | 603 | 5 |
| Employment | factor | 0 | NA | 0.6610000 | NA | NA | 62 | 339 | 5 |
| Installment | integer | 0 | 2.973 | 1.1187147 | 3.0 | 1.4826 | 1 | 4 | 0 |
| Status | factor | 0 | NA | 0.4520000 | NA | NA | 50 | 548 | 4 |
| OtherParties | factor | 0 | NA | 0.0930000 | NA | NA | 41 | 907 | 3 |
| Residence | integer | 0 | 2.845 | 1.1037179 | 3.0 | 1.4826 | 1 | 4 | 0 |
| Asset | factor | 0 | NA | 0.6680000 | NA | NA | 154 | 332 | 4 |
| Age | integer | 0 | 35.546 | 11.3754686 | 33.0 | 10.3782 | 19 | 75 | 0 |
| OtherPlans | factor | 0 | NA | 0.1860000 | NA | NA | 47 | 814 | 3 |
| Housing | factor | 0 | NA | 0.2870000 | NA | NA | 108 | 713 | 3 |
| ExistingCredits | integer | 0 | 1.407 | 0.5776545 | 1.0 | 0.0000 | 1 | 4 | 0 |
| JobType | factor | 0 | NA | 0.3700000 | NA | NA | 22 | 630 | 4 |
| Dependants | integer | 0 | 1.155 | 0.3620858 | 1.0 | 0.0000 | 1 | 2 | 0 |
| Phone | factor | 0 | NA | 0.4040000 | NA | NA | 404 | 596 | 2 |
| Foreign | factor | 0 | NA | 0.0370000 | NA | NA | 37 | 963 | 2 |
| CreditRisk | factor | 0 | NA | 0.3000000 | NA | NA | 300 | 700 | 2 |
| Purpose1 | factor | 0 | NA | 0.4960000 | NA | NA | 12 | 504 | 5 |
| Status1 | factor | 0 | NA | 0.3100000 | NA | NA | 310 | 690 | 2 |
Age
p1 <- ggplot(credit, aes(x = Age)) + geom_histogram(bins = 35) + labs(title = 'Histogram of Ages')
p2 <- ggplot(credit, aes(x = Age, fill = CreditRisk)) +
geom_histogram(bins = 35) + facet_grid(~CreditRisk) +
labs(title = 'Histogram of Age by Credit Risk Classes')
plot_grid(p1, p2, ncol = 1)
Most of individuals who would want to avail a credit facility are aged between 20 and 40. We see a normal distribution for credit risk. Individuals in this age would have commenced their carrier and hence would be able to take calculated risks.
Duration (in months)
p3 <- ggplot(credit, aes(x = Duration)) + geom_histogram(bins = 20) +
labs(title = 'Histogram of Duration of Credit provided')
p4 <- ggplot(credit, aes(x = Duration, fill = CreditRisk)) +
geom_histogram(bins = 20) + facet_grid(~CreditRisk) +
labs(title = 'Histogram of Duration by Credit Risk Classes')
plot_grid(p3, p4, ncol = 1)
Duration here refers to the tenure or length of period an individual agrees to pay out the credit facility. Durations appears to be right skewed with ideal tenure in the range of 12 to 24 months.
Credit Amount (in Deutsch Mark)
p5 <- ggplot(credit, aes(x = CreditAmount)) + geom_histogram(bins = 20) +
labs(title = 'Histogram of Credit Amount z provided')
p6 <- ggplot(credit, aes(x = CreditAmount, fill = CreditRisk)) +
geom_histogram(bins = 20) + facet_grid(~CreditRisk) +
labs(title = 'Histogram of Credit Amount provided by Credit Risk Classes')
plot_grid(p5, p6, ncol = 1)
Credit amount is in Deutsch Mark. On aggregated level, individuals took a loan amount of approximately 1000 to 2000 marks, as indicated by the sharp kurtosis in the histogram of credit amount.
Installment (Rate in % of disposable income)
p7 <- ggplot(credit, aes(x = Installment)) + geom_histogram(bins = 4) +
labs(title = 'Histogram of Installment rate in % of disposable Income')
p8 <- ggplot(credit, aes(x = Installment, fill = CreditRisk)) +
geom_histogram(bins = 4) + facet_grid(~CreditRisk) +
labs(title = 'Histogram of Installment Rate by Credit Risk Classes')
plot_grid(p7, p8, ncol = 1)
We are referring to the interest rate one would pay to avail the credit facility. Usually a lower instalment rate means, less interest payable. From the histogram, we can see that only good credit risk can avail a lower risk rate of 1-2 percent. Bad credit risk is right skewed as the majority instalment rate is aggregated to 4 percent.
Present Residence
p9 <- ggplot(credit, aes(x = Residence)) + geom_histogram(bins = 4) +
labs(title = 'Histogram of Present Residence in years')
p10 <- ggplot(credit, aes(x = Residence, fill = CreditRisk)) +
geom_histogram(bins = 4) + facet_grid(~CreditRisk) +
labs(title = 'Histogram of Present Residence by Credit Risk Classes')
plot_grid(p9, p10, ncol = 1)
Number of years an individual resides in the current address, determines their residential status and shows stability of individual movement. For credit risk a long stay in the current address determines a lower risk and better risk-taking ability. Hence the histogram is right skewed for good credit risk.
Existing Credits
p11 <- ggplot(credit, aes(x = ExistingCredits)) + geom_histogram(bins = 4) +
labs(title = 'Histogram of number of Existing Credits in the bank')
p12 <- ggplot(credit, aes(x = ExistingCredits, fill = CreditRisk)) +
geom_histogram(bins = 4) + facet_grid(~CreditRisk) +
labs(title = 'Histogram of Existing Credits by Credit Risk Classes')
plot_grid(p11, p12, ncol = 1)
Number of existing credit facilities at a bank plays an important role in determining credit risk. Existing debt also means the capability to take more debt decreases. Hence, number of existing credit facilities limited to 1-2. Higher number of existing credit are more likely to be good credit risk.
Number of Dependants
p13 <- ggplot(credit, aes(x = Dependants)) + geom_histogram(bins = 2) +
labs(title = 'Histogram of number of dependants of the individual')
p14 <- ggplot(credit, aes(x = Dependants, fill = CreditRisk)) +
geom_histogram(bins = 2) + facet_grid(~CreditRisk) +
labs(title = 'Histogram of number of dependants by Credit Risk Classes')
plot_grid(p13, p14, ncol = 1)
This factor establishes living cost of an individual. People with higher number dependents to have higher living cost when compared to no dependents. As the histogram is right skewed, it suggests that, individual with no dependents or less dependents are more likely to borrow more credit.
Status of checking account in bank
p15 <- ggplot(credit, aes(x = Checking)) + geom_bar() +
labs(title = 'Bar Chart of Status of checking account') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p16 <- ggplot(credit, aes(x = Checking, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p15, p16, ncol = 2)
Number of people not having a checking account with the bank was considerably high. Individuals with checking account facility of less than 200 were more likely to avail credit facilities and Individuals with checking account facility of less than 0 were likely to be grouped in bad credit risk
Credit History
p17 <- ggplot(credit, aes(x = CreditHistory)) + geom_bar() +
labs(title = 'Bar Chart of Credit History') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p18 <- ggplot(credit, aes(x = CreditHistory, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p17, p18, ncol = 2)
When making lending decisions, lenders review your credit history to determine how likely an individual would repay your loan on time. A longer history shows you have more experience using credit and hence lenders can be more accurate in determining the level of risk they take on when lending to you.
Purpose
p19 <- ggplot(credit, aes(x = Purpose1)) + geom_bar() +
labs(title = 'Bar Chart of Purpose of the loan') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p20 <- ggplot(credit, aes(x = Purpose1, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p19, p20, ncol = 2)
Loan purpose has been divides into four categories. Majority of people usually take a credit facility to purchase a car or make house hold improvements.
Status of Savings Account
p21 <- ggplot(credit, aes(x = Saving)) + geom_bar() +
labs(title = 'Bar Chart of Status of Savings Account') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p22 <- ggplot(credit, aes(x = Saving, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p21, p22, ncol = 2)
Savings or cash held in once savings account are divided into four categories. Savings of less than 100 thousand marks are likely to borrow credit facility. Credit risk remains good when individual have more than 100 thousand marks in savings.
Employment
p23 <- ggplot(credit, aes(x = Employment)) + geom_bar() +
labs(title = 'Bar Chart of Employment Tenure') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p24 <- ggplot(credit, aes(x = Employment, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p23, p24, ncol = 2)
Segregating employment tenure by credit risk showed that good credit risk individuals were more likely to have higher employment tenure whereas bed credit risk individuals were mostly lower employment tenure.
Sex of Individual
p25 <- ggplot(credit, aes(x = Status1)) + geom_bar() +
labs(title = 'Bar Chart of Sex of Individual', x = "Gender") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p26 <- ggplot(credit, aes(x = Status1, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '', x = "Gender") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p25, p26, ncol = 2)
Males dominated the overall credit risk taking ability. However, the bad credit risk was proportionality more for the females.
Other Parties Involved
p27 <- ggplot(credit, aes(x = OtherParties)) + geom_bar() +
labs(title = 'Bar Chart of Other Parties Involved') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p28 <- ggplot(credit, aes(x = OtherParties, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p27, p28, ncol = 2)
In total three parties were involved – co applicant, guarantor and none. None stood out among the three, as it also included individuals who had applied for credit facility alone.
Asset Type
p29 <- ggplot(credit, aes(x = Asset)) + geom_bar() +
labs(title = 'Bar Chart of Asset Type') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p30 <- ggplot(credit, aes(x = Asset, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p29, p30, ncol = 2)
Individuals are more likely to be grouped in good credit risk if one has listed a real estate or car as an asset. On the other hand, people who do not own a real estate are more likely to default or fall in bad credit risk. One of the reason behind this could be that, people who own a piece of real estate property are more likely to avail credit facility at a lower interest, as it would fall under secured lending.
Other Plans with the bank
p31 <- ggplot(credit, aes(x = OtherPlans)) + geom_bar() +
labs(title = 'Bar Chart of Other Plans with the bank') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p32 <- ggplot(credit, aes(x = OtherPlans, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p31, p32, ncol = 2)
Individuals who do not have any other commitments towards other bank, stores or other facility would have more appetite for credit facility, as their undistributed income would be higher.
Housing
p33 <- ggplot(credit, aes(x = Housing)) + geom_bar() +
labs(title = 'Bar Chart of Housing status') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p34 <- ggplot(credit, aes(x = Housing, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p33, p34, ncol = 2)
Most individuals own a house. Possibility to falling into bad credit risk group increases when living arrangement status is rent. People living for free do not have any form of commitment in terms of living expense and hence possibility of falling into bad credit risk is minimal.
Job Type
p35 <- ggplot(credit, aes(x = JobType)) + geom_bar() +
labs(title = 'Bar Chart of Job Type') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p36 <- ggplot(credit, aes(x = JobType, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p35, p36, ncol = 2)
If unemployed, possibility of getting credit facility remains minimal. Skilled job type individual availed the maximum credit risk facility. Possibility of an unskilled job type to fall into bad credit risk is proportionality higher.
Phone
p37 <- ggplot(credit, aes(x = Phone)) + geom_bar() +
labs(title = 'Bar Chart of Phone') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p38 <- ggplot(credit, aes(x = Phone, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p37, p38, ncol = 2)
Most individuals do not have a phone connection. However, this factor did not play a significant role in availing credit faculty.
Foreign Worker
p39 <- ggplot(credit, aes(x = Foreign)) + geom_bar() +
labs(title = 'Bar Chart of Foreign Worker') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p40 <- ggplot(credit, aes(x = Foreign, fill = CreditRisk)) +
geom_bar() + facet_grid(CreditRisk~.) +
labs(title = '') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plot_grid(p39, p40, ncol = 2)
Nationality played a significant role in availing credit facility. Most individuals who availed credit facility where foreigners.
Job Type, Sex and Asset The following visual shows that most of the females had bad risk, and males had good risk. Unemplyed females however had good risk, whereas unemplyed males had bad risk. Most of the females had life insurance as an asset wheras males had real estate. The credit risk based on job type differed a lot between males and females. Particular aspects favoured females whereas others favoured males.
Age, Sex and Employment Years The following visualisation shows that the highest concentration of males varied from 20 to 55, wheras for females it varied between 20 and 45. There were also more males who were emplyed for more than 1 years. A lot of males had been employed for more than 7 years, as compared to females. Most of the females had less than 4 years of employment.
Housing, Residence and Purpose The following visualisation shows that most of the individuals had their own housing, out of which maximum had good credit. Lots of individuals who were staying for free for 4 years had bad credit risk. Credit for household expenses was the most sought after credit scheme for individuals with their own house, wheras people who lived for free had credit for cars. People on rent had similar numbers in credit for purpose of both cars and household.
Credit History, Sex and Number of dependants From the visualisation, we can see that most of the females had only 1 dependant, whereas males had slightly more in proportion. Males with 1 or 2 dependant mainly either had existing paid credit or critical credit.
ggplot(credit, aes(x = CreditHistory, fill = CreditRisk)) +
geom_bar(position = 'dodge') + facet_grid( Status1~ Dependants ) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = 'Bar Chart: Credit History, Sex and Number of dependants', x = "Credit History", fill = "Credit Risk")
From the data exploration phase, we found that the numerical features were quite tidy, and hence no cleaning needed to be undertaken. In case of the categorical variables, we had to define some new features which binned their corresponding original features into lower cardinalities. We did not remove any of the original variables in the dataset, as we would like to play around with granularity at the model building phase.
From the above exploration and visualisation in the Data Exploration step, we found that Status1 (Gender), Purpose1 (Purpose of credit), Housing (Current Accomodation type), Age, Employment (In years), JobType and Asset (eg. property, insurance) can possiblly be useful predictors for the CreditRisk class.