1 Abstract

Currently, many machine learning models have been deployed to predict whether a person will default their loan amount or they will pay it back. These decisions are made using predictive modeling and ML by using various factors, such as their education level, sex, personal status, checking the amount, number of savings bonds, and many more. This project aims to use one of such publicly available datasets, the Statlog - German Credit Risk Dataset, which has anonymized data of many bank customers, with their details and whether they had defaulted their loan amount or they were good customers and paid back their loan amount. Within this project, I aim to first pre-process the data into both user’s readable and machine readable format, explore the data and derive inferences, and finally use this to predict whether if a person will default their loan or not.

Keywords - German-Credit-Risk, Machine-Learning, Predictive Modelling

2 Introduction

Most banks’ primary source of income is from providing loans for their customers. They store people’s money and pay them some interest on that money, and to some other customers, provide a loan for a purpose at a higher interest than before. This margin between the saving interest and the loan interest is where banks make most of their money.

However, every time a bank provides a loan, it faces a risk of the loan not being paid back. Generally, banks take some collateral, such as a person’s property. However, most banks would want even to avoid providing a person who will default their loan since they are losing money and time value of money. In that time, they could have invested in a loan to a person who would pay their loan.

Therefore, it is crucial to determine whether a person is a defaulter or someone who will pay back their loan before the bank even provides the loan. In In this project, I pre-processed the data, then plotted graphs using the powerful. R Programming Language and the plotly package. Using these graphs, I have also derived inferences from these plots and finally used the data to build a machine a learning model that predicts whether if a person will be a defaulter or not.

I also wish to make a Shiny web application that takes all of the required data and predicts whether if a person can be provided with a loan or not.

PROJECT: https://github.com/suryasashankgundepudi/german-credit-risk-modelling

SHINY WEB APP: YET TO BE DEVELOPED

2.1 About the Various Aspects of the project

Packages Required

RCurl
dplyr
tidyr
descr
reshape2
ggplot2
plotly
CatEncoders
stringr
superml

Inspiration and Acknowledgements

I would like to thank the Kaggle Community for helping me review the code and also bringing many ideas for data visualizing techniques. I have taken inspiration from various places and have tried to implement this project in R.
I was incentivized to take up this project since I had just learnt R Programming and wanted to learn various data visualization techniques.
I would also like to thank the Stackoverflow Community for helping me with some of the most simple yet crucial problems.

3 Dataset Description

This dataset was provided by Dr. Hand Hoffmann from the University of Hamburg (Universit"at Hamburg). It is publicly available for data scientists to use at the UCI MACHINE LEARNING REPOSITOY. The direct link to the dataset, with both numeric and the true data, is at - STATLOG-GERMAN-CREDIT.

The data contains anonymized data of 1000 customers who have either defaulted their bank loan or have paid back their credits duly. It contains 20 attributes, 7 of which are numerical and 13 of which are categorical. These attributes contain relevant information about the customer. They have been listed below:

Status of existing checking account (categorical)
Duration in month (numerical)
Credit history (categorical)
Purpose (categorical)
Credit amount (numerical)
Savings account/bonds (categorical)
Present employment since (categorical)
Installment rate in percentage of disposable income (numerical)
Personal status (categorical)
Sex (categorical)
Other debtors / guarantors (numerical)
Present residence since (categorical)
Property (categorical)
Age in years (numerical)
Other installment plans (categorical)
Housing (categorical)
Number of existing credits at this bank (numerical)
Number of people being liable to provide maintenance for (numerical)
Job (categorical)
Telephone (numerical)
foreign worker (categorical)

The target variable is the outcome or risk taken by the bank. It contains 1 if the risk taken was good and the person was not a defaulter and 0 if the person was a defaulter.

4 Data Pre-Processing

Within this section I will cover some of the basic data-preprocessing techniques I had employed to get to a more understandable and descriptive data.

4.1 Reading the data.

The data was first read from the UCI- machine learning repository using the following chunck of code. The required package for this chunk is RCurl

i saved this data into a new directory for further processing.

The table below shows how the data looks without any kind of pre-processing

The Raw Data.
V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21
A11	6	A34	A43	1169	A65	A75	4	A93	A101	4	A121	67	A143	A152	2	A173	1	A192	A201	1
A12	48	A32	A43	5951	A61	A73	2	A92	A101	2	A121	22	A143	A152	1	A173	1	A191	A201	2
A14	12	A34	A46	2096	A61	A74	2	A93	A101	3	A121	49	A143	A152	1	A172	2	A191	A201	1
A11	42	A32	A42	7882	A61	A74	2	A93	A103	4	A122	45	A143	A153	1	A173	2	A191	A201	1
A11	24	A33	A40	4870	A61	A73	3	A93	A101	4	A124	53	A143	A153	2	A173	2	A191	A201	2

4.2 Mutating the data

As you can see, the data does not have any column names and the data by itself does not have names but is instead in the form of various categorical columns.

Since the data by itself cannot be used for any exploratory data analysis, I had used a switch case type of code to provide names for each data point in the data. For my reference, I had used the German.doc provided at the UCI machine learning repository. The file contained a detailed description of each attribute and what each value meant. For getting a better understanding of the rudimentary yet robust code I had employed, please visit the project page at - German-credit-Risk-Repo,

Now after cleaning the data for exploratory data analysis I was able to get to a more descriptive data:

Clean Data
Checking.Account	Duration	Credit.History	Purpose	Credit.Amount	Savings.Account.Bonds	Present.employee
< 0	6	critical account/other credits existing (not at this bank)	Radio or Television	1169	Unkown/No Savings Account	Exp >= 7
0 <= Checking < 200	48	existing credits paid back duly till now	Radio or Television	5951	Less than 100	1 <= Exp < 4
No Checking account	12	critical account/other credits existing (not at this bank)	Education	2096	Less than 100	4 <= Exp <7
< 0	42	existing credits paid back duly till now	Furniture/Equipment	7882	Less than 100	4 <= Exp <7
< 0	24	delay in paying off in the past	New Car	4870	Less than 100	1 <= Exp < 4

However, I also had to make sure the data was in a machine readable format for the later part of this project. The preprocessing was completed using the superML package in R. The following R-Code chunk helps us convert this character-based data frame into numeric data for predictive modeling.

# Reading the clean data. 
data <- read.csv("data/eda-german-credit.csv")

# Defining a new variable which takes col names of qualitative columns
catColumns <- c("Checking.Account", "Credit.History", "Purpose", 
                "Savings.Account.Bonds", "Present.employee", 
                "Other.Debters", "Property", "Other.Installment.plans", 
                "Housing", "Job", "Telephone", "Foreign.Worker", "Outcome", 
                "Sex", "Personal.Status")


tf_data <- data.frame(data)

# Transforming each qualitative column into numerical labels
for (column in catColumns){
  label <- LabelEncoder$new()
  tf_data[, column] <- label$fit_transform(tf_data[, column])
}

# Saving the data into a new file for later use
write.csv(tf_data, "data/machine-ready-credit.csv", row.names = FALSE)

The data after being converted for machine readable format looked like this. As you would’ve expected the data had no character variables.

Machine Readable Data
Checking.Account	Duration	Credit.History	Purpose	Credit.Amount	Savings.Account.Bonds	Present.employee
0	6	0	0	1169	0	0
1	48	1	0	5951	1	1
2	12	0	1	2096	1	2
0	42	1	2	7882	1	2
0	24	2	3	4870	1	1

5 Exploratory Data Analysis

Since the data is now more descriptive, I attempted to plot various graphs, most of which are interactive to derive inferences. There are significant parts of this data analysis module.

Gender Analysis
Age Analysis
Wealth and Job Analysis

Each of these categories aims better to understand the distribution of the data across various demographics. There are also some miscellaneous plots I have included, which I thought would help me derive other inferences.

I initially wanted to understand the target variable’s (whether if the loan provided was a good decision or a bad one) distribution. The bar graph shown below lets us understand it better.

From this graph, we understand that there is a class imbalance in our target variable. Though the number of people who defaulted their loan is lesser than the customers who paid back their credits duly it is still a high ratio, and it is our aim to reduce the number of defaulted loan decisions.

5.1 Gender Analysis

To get an idea of how the population of our dataset was distributed, I plotted a histogram that shows the distribution of the age group across the two genders and as a whole.

data_age = data[, "Age.in.Years"]
age_male = data[data$Sex == "Male ", "Age.in.Years"]
age_female = data[data$Sex == "Female ", "Age.in.Years"]
trace0 <- plot_ly(
  x = age_male, 
  type = "histogram", 
  histnorm='probability',
  name="Male Age",
  marker = list(
    color = 'rgba(100, 149, 247, 0.7)'
  )
)
trace0 <- trace0 %>% 
  layout(bargap = 0.05)

trace1 <- plot_ly(
  x = age_female, 
  type = "histogram", 
  histnorm='probability',
  name="Female Age",
  marker = list(
    color = 'rgba(255, 105, 180, 0.7)'
  )
)
trace1 <- trace1 %>% 
  layout(bargap = 0.05)

trace2 <- plot_ly(
  x = data_age, 
  type = "histogram", 
  histnorm='probability',
  name="Distribution",
  marker = list(
    color = 'rgba(195, 177, 225, 0.7)'
  )
)
trace2 <- trace2 %>% 
  layout(bargap = 0.05)

sub1 <- subplot(trace0, trace1)
fig1 <- subplot(sub1, trace2, shareX = TRUE, nrows = 2) 

fig1 <- fig1 %>%   layout(
  title = 'Age Distribution', 
  font=t, 
  plot_bgcolor = "#e5ecf6"
)

It is understood that the age group of people who wish to take a loan are in their 20s and 30s. This is irrespective of gender which can be seen in the overall distribution.

In the next plot I plot the reasons why men and women take a loan. To visualize this I have plotted a horizontal grouped bar graph that shows the distribution of men and women across various purposes.

purpose_list <- c(" New Car", "Business", "Domestic Appliances", "Education",         
                  "Furniture/Equipment", "Others", "Radio or Television",
                  "Repairs", "Retraining", "Used Car")

purpose_male <- unname(prop.table(table(data[data$Sex == "Male ", "Purpose"])))
purpose_female <- unname(prop.table(table(data[data$Sex == "Female ", "Purpose"])))


# VARIATION OF GENDER WITH PURPOSE 

trace0 <- plot_ly(
  x = purpose_male,
  y = purpose_list,
  name = "Male Purpose",
  type = "bar", 
  marker = list(color = 'rgba(100, 149, 247, 0.7)',
                line = list(color = 'rgb(8,48,107)',
                            width = 1.5))
)

fig2 <- trace0 %>%  add_trace(
  x = purpose_female,
  y = purpose_list,
  name = "Female Purpose",
  type = "bar", 
  marker = list(color = 'rgba(255, 105, 180, 0.7)',
                line = list(color = 'rgb(8,48,107)',
                            width = 1.5))
)
fig2 <- fig2 %>% layout(
  yaxis = list(title = 'Reason'), 
  xaxis = list(title = "Percentage of Gender"), 
  barmode = 'group'
)

The graph is plotted by taking the percentage of the number of men and women for different purposes, and then plotting them side by side. From this, it is inferred that in general, for all categories other than furniture and domestic appliances, there is a higher percentage of men who take a loan than women.

In the following graph, I plot the distribution of the credit amount that men and women have in their bank accounts. The x-axis plots the amount of money in Deutsche Mark and the y axis plot the count of the same.

trace3 <- ggplot(data[data$Sex == "Male ", ], aes(x =Credit.Amount)) + 
  geom_histogram(alpha=0.5, position="identity",  bins = 50, 
                 color="blue", fill = "lightblue") + 
  geom_vline(aes(xintercept=mean(data[data$Sex == "Male ", "Credit.Amount"])), 
             color="black", linetype="dashed", size=0.2)


trace4 <- ggplot(data[data$Sex == "Female ", ], aes(x =Credit.Amount)) + 
  geom_histogram(alpha=0.5, position="identity",  bins = 50,
                 color="red", fill = "pink") +  
  geom_vline(aes(xintercept=mean(data[data$Sex == "Female ", "Credit.Amount"])), 
             color="black", linetype="dashed", size=0.3)


fig3 <- subplot(trace3, trace4, nrows = 2, shareX = TRUE)

It could be hypothesized that the gender of a person does not affect their credit amount and that majority of the population has a credit amount between 1000 DMK and 2500 DMK.

Finally, for our gender analysis I have attempted to see if gender affects a person’s loan outcome. The next plot shows the count of men and women who have good and bad outcomes respectively.

###############################################################################
#                                                                             #  
#                    DISTTRIBUTION OF GENDER VS RISK                        #
#                                                                             #
###############################################################################

trace5 <- plot_ly(data = data, 
                x = names(table(data[data$Outcome == "Good", "Sex"])), 
                y = table(data[data$Outcome == "Good", "Sex"]), 
                type = 'bar', 
                name = 'Good Credit')

fig5 <- trace5 %>% 
  add_trace(y = table(data[data$Outcome == "Bad", "Sex"]), 
            name = 'Bad Credit')

fig4 <- fig5 %>% 
  layout(xaxis = list(title = 'Reason'),
         yaxis = list(title = 'Count'),
         barmode = 'group')

It could be understood from the above plot that men, in general, have a higher ratio of good to bad outcomes than women. However, the data might not entirely represent the general population as there is an imbalance between men and women.

Summary of Gender Analysis

The age distribution of men and women is extremely similar
Gender does not affect the credit amount
Males tend to have lower percentage of bad risk than women.

5.2 Age Analysis

I attempted to see if the various age groups have better or lesser risk in the age analysis module. I also try to look at the credit amount distribution, but I do not look at outliers in this analysis.

The population has been split into 4 mmajor age groups equaly.

Young
Young Adult
Adult
Senior

For our first plot I plotted a stacked histogram with age distribution for people with good and bad credit.

# Function takes data, feature to plot the distribution of, and separating label
plot_multi_histogram <- function(df, feature, label_column) {
  
  plt <- ggplot(df, aes(x=eval(parse(text=feature)), 
                        fill=eval(parse(text=label_column)))) +
    geom_histogram(alpha=0.5, position="identity",  
                   aes(y = ..density..), color="black") +
    geom_density(alpha=0.1) +
    geom_vline(aes(xintercept=mean(eval(parse(text=feature)))), 
               color="black", linetype="dashed", size=1) +
    labs(x=feature, 
         y = "Density") 
  
  plt + guides(fill=guide_legend(title=label_column))
}


fig1 <- plot_multi_histogram(data, "Age.in.Years", "Outcome")

It can be inferred from this graph that most of the younger population are the people with a bad risk. However, the graph is also right-skewed for good credit.

However, the age group of people with good credit lie in their late 20s and early 30s.

The following graph is a box plot of various age groups against the credit amount. This way, I will see if different age groups are more or less affluent than the other groups.

#Let's look the Credit Amount column
interval = c(18, 25, 35, 60, 120)

cats = c('Young', 'Young Adult', 'Adult', 'Senior')
data["Age_Group"] = cut(data$Age.in.Years, interval, labels=cats)

data_good = data.frame(data[data["Outcome"] == 'Good', ])
data_bad = data.frame(data[data["Outcome"] == 'Bad', ])



fig2 <- plot_ly(
  y = data_good$Credit.Amount, 
  x = data_good$Age_Group, 
  name="Good credit",
  color = '#3D9970', 
  type = "box"
)

fig2 <- fig2 %>%
  add_trace(
    y = data_bad$Credit.Amount, 
    x = data_bad$Age_Group, 
    name="Bad credit", 
    color = "Blue", 
    type = "box"
  )


fig2 <- fig2 %>%
  layout(
    yaxis=list(
      title='Credit Amount (US Dollar)',
      zeroline=F
    ),
    xaxis=list(
      title='Age Categorical'
    ),
    boxmode='group'
  )

Young adults and adults have a higher credit than other age groups. The graph also shows that, in general, people with lesser credit amounts have bad outcomes.

Another representation of the same is a violin plot as shown below. The violin plot also provides us with similar inferences as the box plot.

fig3 <- data %>%
  plot_ly(type = 'violin') 
fig3 <- fig3 %>%
  add_trace(
    x = data_good$Age_Group,
    y = data_good$Credit.Amount,
    legendgroup = 'Good Credit',
    scalegroup = 'Good Credit',
    name = 'Good Credit',
    side = 'negative',
    box = list(
      visible = T
    ),
    meanline = list(
      visible = T
    ),
    color = I("blue")
  ) 
fig3 <- fig3 %>%
  add_trace(
    x = data_bad$Age_Group,
    y = data_bad$Credit.Amount,
    legendgroup = 'Bad Credit',
    scalegroup = 'Bad Credit',
    name = 'Bad Credit',
    side = 'positive',
    box = list(
      visible = T
    ),
    meanline = list(
      visible = T
    ),
    color = I("red")
  ) 

fig3 <- fig3 %>%
  layout(
    xaxis = list(title = "Age Categories"),
    yaxis = list(
      title = "Credit Amount",
      zeroline = F
    )
  )

Finally I plot a stacked bar graph against good and bad loans for different age groups to understand the ratio of good and bad credits.

young_good = sum(na.omit(data_good$Credit.Amount[data$Age_Group == "Young"]))
young_bad = sum(na.omit(data_bad$Credit.Amount[data$Age_Group == "Young"]))

young_adult_good = sum(na.omit(data_good$Credit.Amount[data$Age_Group == "Young Adult"]))
young_adult_bad = sum(na.omit(data_bad$Credit.Amount[data$Age_Group == "Young Adult"]))

adult_good = sum(na.omit(data_good$Credit.Amount[data$Age_Group == "Adult"]))
adult_bad = sum(na.omit(data_bad$Credit.Amount[data$Age_Group == "Adult"]))

elder_good = sum(na.omit(data_good$Credit.Amount[data$Age_Group == "Senior"]))
elder_bad = sum(na.omit(data_bad$Credit.Amount[data$Age_Group == "Senior"]))

young_good_p = young_good/(young_good + young_bad) * 100
young_bad_p = young_bad/(young_good + young_bad) * 100
young_adult_good_p = young_adult_good/(young_adult_good + young_adult_bad) * 100
young_adult_bad_p = young_adult_bad/(young_adult_good + young_adult_bad) * 100
adult_good_p = adult_good/(adult_good + adult_bad) * 100
adult_bad_p =  adult_bad/(adult_good + adult_bad) * 100
elder_good_p = elder_good/(elder_good + elder_bad) * 100
elder_bad_p = elder_bad/(elder_good + elder_bad) * 100


young_good_p = round(young_adult_bad_p, 3)
young_bad_p = round(young_bad_p, 3)
young_adult_good_p = round(young_adult_good_p, 3)
young_adult_bad_p = round(young_adult_bad_p, 3)
adult_good_p = round(adult_good_p, 3)
adult_bad_p =  round(adult_bad_p, 3)
elder_good_p = round(elder_good_p, 3)
elder_bad_p = round(elder_bad_p, 3)


good_text <- c(paste(young_good_p, '%'), paste(young_adult_good_p, '%'), 
                  paste(adult_good_p, '%'), paste(elder_good_p, '%'))

bad_text <- c(paste(young_bad_p, '%'), paste(young_adult_bad_p, '%'), 
                 paste(adult_bad_p, '%'), paste(elder_bad_p, '%'))


good_loans <- plot_ly(
  x=cats,
  y=c(young_good, young_adult_good, adult_good, elder_good),
  type = "bar", 
  name="Good Loans",
  text=good_text,
  textposition = 'auto',
  marker=list(
    color='rgb(111, 235, 146)',
    line=list(
      color='rgb(60, 199, 100)',
      width=1.5)
  ),
  opacity=0.6
)

overall_loan <- good_loans %>%  add_trace(
  x=cats,
  y = c(young_bad, young_adult_bad, adult_bad, elder_bad), 
  text=bad_text,
  type = "bar",
  name="Bad Loans",
  textposition = 'auto',
  marker=list(
    color='rgb(247, 98, 98)',
    line=list(
      color='rgb(225, 56, 56)',
      width=1.5)
  ),
  opacity=0.6
)


overall_loan <- overall_loan %>% layout(
  title="Type of Loan by Age Group", 
  xaxis = list(title="Age Group"),
  yaxis= list(title="Credit Amount")
)

From this graph, it is understood that young adults have the highest ratio of good to bad risk outcomes. On the other hand, seniors surprisingly have the lowest ratio of good to bad risk outcomes.

Summary of Age Analysis

Young adults tend make up most of the bad credit population
People in their mid life (Adults) make up most of good credit population
Adults have the highest ratio of Good to Bad credit
Seniors have the lowest ratio of Good to Bad credit.

5.3 Wealth and Job Analysis

5.3.1 Wealth Analysis

Within this section I try to understand how people from different wealth classes are distributed in our data-set.

trace5 <- plot_ly(data = data, 
                  x = names(table(data[data$Outcome == "Good", "Savings.Account.Bonds"])), 
                  y = table(data[data$Outcome == "Good", "Savings.Account.Bonds"]), 
                  type = 'bar', 
                  name = 'Good Credit')

fig6 <- trace5 %>% 
  add_trace(y = table(data[data$Outcome == "Bad", "Savings.Account.Bonds"]), 
            name = 'Bad Credit')

fig6 <- fig6 %>% 
  layout(xaxis = list(title = 'Number of Bonds'),
         yaxis = list(title = 'Count'),
         barmode = 'group')

Surprisingly people from the higher class, i.e., with more savings, have a lower ratio of good to bad outcomes. Also, people from the lower savings sector have a higher ratio of good to bad credit. However, the highest class of people have the highest good to bad credit ratio.

A similar distribution can be seen based on people of different types of credit payment.

trace5 <- plot_ly(data = data, 
                  x = names(table(data[data$Outcome == "Good", "Credit.History"])), 
                  y = table(data[data$Outcome == "Good", "Credit.History"]), 
                  type = 'bar', 
                  name = 'Good Credit')

fig7 <- trace5 %>% 
  add_trace(y = table(data[data$Outcome == "Bad", "Credit.History"]), 
            name = 'Bad Credit')

fig7 <- fig7 %>% 
  layout(barmode = 'group')

5.3.2 Job Analysis

For the data provided, the job attribute is split into different levels of skill and industry. In this analysis, I plotted two different plots. One with the different types of jobs and their credit amount.

fig5 <- data %>%
  plot_ly(type = 'violin') 
fig5 <- fig5 %>%
  add_trace(
    x = data_good$Job,
    y = data_good$Credit.Amount,
    legendgroup = 'Good Credit',
    scalegroup = 'Good Credit',
    name = 'Good Credit',
    side = 'negative',
    box = list(
      visible = T
    ),
    meanline = list(
      visible = T
    ),
    color = I("blue")
  ) 
fig5 <- fig5 %>%
  add_trace(
    x = data_bad$Job,
    y = data_bad$Credit.Amount,
    legendgroup = 'Bad Credit',
    scalegroup = 'Bad Credit',
    name = 'Bad Credit',
    side = 'positive',
    box = list(
      visible = T
    ),
    meanline = list(
      visible = T
    ),
    color = I("red")
  ) 

fig5 <- fig5 %>%
  layout(
    xaxis = list(
      title = ""  
    ),
    yaxis = list(
      title = "Credit Amount (German DMK)",
      zeroline = F
    )
  )

fig4 <- plot_ly(
  y = data_good$Credit.Amount, 
  x = data_good$Job, 
  name="Good credit",
  color = '#3D9970', 
  type = "box"
)

fig4 <- fig4 %>%
  add_trace(
    y = data_bad$Credit.Amount, 
    x = data_bad$Job, 
    name="Bad credit", 
    color = "Blue", 
    type = "box"
  )


fig4 <- fig4 %>%
  layout(
    yaxis=list(
      title='Credit Amount (German DMK)',
      zeroline=F
    ),
    boxmode='group'
  )

It can be postulated from the two graphs above that Self-employed, or highly qualified professionals have high good and bad outcomes. The people with high credit amounts are also people in highly qualified or self-employed professionals. It is my opinion that since this part of this attribute includes self-employed people, they might take loans for their businesses, and these businesses might not have been able to pay back their loans. The previous statement might also explain why they have so much credit amount.

Summary of Wealth and Job Analysis

People in the middle class seem to have a low ratio of good to bad outcome with respect to risk and the people with highest savings have the highest ratio of good to bad credit.
It can also be seen that people with low savings make up most of the population that are asking for loans. This can also be due to the fact that the dataset might be a convenience dataset.
With respect to job analysis, people who are self employed take the highest amount of credit in DMK, and they have equal distribution of people with a good and bad outcome.

5.4 Miscellaneous Plots

The following two graphs are plotted majorly look at the distribution of various types of home owners in good and bad outcomes.

fig2 <- plot_ly(data = data, 
                x = names(table(data[data$Outcome == "Good", "Housing"])), 
                y = table(data[data$Outcome == "Good", "Housing"]), 
                type = 'bar', 
                name = 'Good Credit')

fig2 <- fig2 %>% 
  add_trace(y = table(data[data$Outcome == "Bad", "Housing"]), 
            name = 'Bad Credit')

fig2 <- fig2 %>% 
  layout(xaxis = list(title = 'Housing Type'),
         yaxis = list(title = 'Count'),
         barmode = 'group')

We can see that people who live for free have lowest ratio of good to bad outcome for a loan payment, and that home owners have highest ratio of the same.

Finally, to understand why people wish to take up the loan I plotted various box plots. The graph is as shown below.

fig3 <- ggplot(data, aes(x=Purpose, y=Credit.Amount, fill = Purpose)) + 
  geom_boxplot() + 
  labs(title="Distribution of Credit VS Purpose",x="Purpose", 
       y = "Credit ammount (DK)") + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 10))

Though most of the population who take up loans for other purposes take up the highest amount of money for their loans. It can also be seen that the following type of people to take up loans are the ones who wish to pay their car loans. The people who take up loans for domestic appliances are the ones who take the lowest amount for their loan.

This concludes our Exploratory data analysis section. We will now move on to predictive modelling using various machine learning techniques

6 Predictive Modelling

Within this section I employed various machine learning algorithms to classify whether if a person will default their loan or not. For predictive modelling I had employed the Python Programming Language to implement ML algorithms because of its better support for the same.

Some of the algorithms I used are

Logistic Regression
Random Forest
Decision Trees
K Nearest Neighbors
Linear Discriminant Analysis
Quadratic Discriminant Analysis
XGBoost

We will be looking at the precision, recall and f1 score for these algorithms. The code for this can be found in the interactive python notebook at the project repository

6.1 Results

NAME.OF.ML.ALGORITHM.USED	PRECISION.0	PRECISION.1	RECALL.0	RECALL.1	F1.SCORE.0	F1.SCORE.1
DECISION TREES	0.77	0.39	0.71	0.46	0.74	0.42
LOGISTIC REGRESSION	0.79	0.65	0.91	0.42	0.85	0.51
RANDOM FOREST	0.78	0.66	0.92 0	0.38	0.85	0.48
XGBOOST	0.82	0.69	0.9	0.51	0.86	0.59
QUADRATIC DISCRIMINANT ANALYSIS	0.83	0.57	0.82	0.58	0.82	0.58
SUPPORT VECTOR CLASSIFIERS	0.77	0.66	0.94	0.29	0.84	0.40

Though the results are not as great, I hope to implement a fine-tuned Deep-learning model that provides us with better results.

7 Conclusions

The German Credit data was read from the UCU machine learning repository. Initial data pre-processing was implemented to bring about clean and understandable data. Then the data was used to perform exploratory data analysis and derive inferences.
Finally, the machine-ready data were scaled and used for predicting if a person would default on their loans or not. Various machine learning algorithms were used for this purpose, and the XGBoost model performed the best compared to other algorithms.

8 Future Work

I wish to in the future make a deep learning algorithm for this data. We can also implement the trained algorithms for a Shiny Web application. However, most of all, I wish to implement the machine learning algorithms using R Programming. Since the data is also not representative one could implement data augmentation to make the data more descriptive.

9 References

UCI Machine learning dataset - [Link](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
Caret - LINK
Kaggle.com LINK

CREDIT RISK ANALYSIS AND PREDICTIVE MODELLING

Surya Sashank Gundepudi

10/16/2021