Currently, many machine learning models have been deployed to predict whether a person will default their loan amount or they will pay it back. These decisions are made using predictive modeling and ML by using various factors, such as their education level, sex, personal status, checking the amount, number of savings bonds, and many more. This project aims to use one of such publicly available datasets, the Statlog - German Credit Risk Dataset, which has anonymized data of many bank customers, with their details and whether they had defaulted their loan amount or they were good customers and paid back their loan amount. Within this project, I aim to first pre-process the data into both user’s readable and machine readable format, explore the data and derive inferences, and finally use this to predict whether if a person will default their loan or not.
Keywords - German-Credit-Risk, Machine-Learning, Predictive Modelling
Most banks’ primary source of income is from providing loans for their customers. They store people’s money and pay them some interest on that money, and to some other customers, provide a loan for a purpose at a higher interest than before. This margin between the saving interest and the loan interest is where banks make most of their money.
However, every time a bank provides a loan, it faces a risk of the loan not being paid back. Generally, banks take some collateral, such as a person’s property. However, most banks would want even to avoid providing a person who will default their loan since they are losing money and time value of money. In that time, they could have invested in a loan to a person who would pay their loan.
Therefore, it is crucial to determine whether a person is a defaulter or someone who will pay back their loan before the bank even provides the loan. In In this project, I pre-processed the data, then plotted graphs using the powerful. R Programming Language and the plotly package. Using these graphs, I have also derived inferences from these plots and finally used the data to build a machine a learning model that predicts whether if a person will be a defaulter or not.
I also wish to make a Shiny web application that takes all of the required data and predicts whether if a person can be provided with a loan or not.
PROJECT: https://github.com/suryasashankgundepudi/german-credit-risk-modelling
SHINY WEB APP: YET TO BE DEVELOPED
This dataset was provided by Dr. Hand Hoffmann from the University of Hamburg (Universit"at Hamburg). It is publicly available for data scientists to use at the UCI MACHINE LEARNING REPOSITOY. The direct link to the dataset, with both numeric and the true data, is at - STATLOG-GERMAN-CREDIT.
The data contains anonymized data of 1000 customers who have either defaulted their bank loan or have paid back their credits duly. It contains 20 attributes, 7 of which are numerical and 13 of which are categorical. These attributes contain relevant information about the customer. They have been listed below:
The target variable is the outcome or risk taken by the bank. It contains 1 if the risk taken was good and the person was not a defaulter and 0 if the person was a defaulter.
Within this section I will cover some of the basic data-preprocessing techniques I had employed to get to a more understandable and descriptive data.
The data was first read from the UCI- machine learning repository using the following chunck of code. The required package for this chunk is RCurl
i saved this data into a new directory for further processing.
The table below shows how the data looks without any kind of pre-processing
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A11 | 6 | A34 | A43 | 1169 | A65 | A75 | 4 | A93 | A101 | 4 | A121 | 67 | A143 | A152 | 2 | A173 | 1 | A192 | A201 | 1 |
| A12 | 48 | A32 | A43 | 5951 | A61 | A73 | 2 | A92 | A101 | 2 | A121 | 22 | A143 | A152 | 1 | A173 | 1 | A191 | A201 | 2 |
| A14 | 12 | A34 | A46 | 2096 | A61 | A74 | 2 | A93 | A101 | 3 | A121 | 49 | A143 | A152 | 1 | A172 | 2 | A191 | A201 | 1 |
| A11 | 42 | A32 | A42 | 7882 | A61 | A74 | 2 | A93 | A103 | 4 | A122 | 45 | A143 | A153 | 1 | A173 | 2 | A191 | A201 | 1 |
| A11 | 24 | A33 | A40 | 4870 | A61 | A73 | 3 | A93 | A101 | 4 | A124 | 53 | A143 | A153 | 2 | A173 | 2 | A191 | A201 | 2 |
As you can see, the data does not have any column names and the data by itself does not have names but is instead in the form of various categorical columns.
Since the data by itself cannot be used for any exploratory data analysis, I had used a switch case type of code to provide names for each data point in the data. For my reference, I had used the German.doc provided at the UCI machine learning repository. The file contained a detailed description of each attribute and what each value meant. For getting a better understanding of the rudimentary yet robust code I had employed, please visit the project page at - German-credit-Risk-Repo,
Now after cleaning the data for exploratory data analysis I was able to get to a more descriptive data:
| Checking.Account | Duration | Credit.History | Purpose | Credit.Amount | Savings.Account.Bonds | Present.employee |
|---|---|---|---|---|---|---|
| < 0 | 6 | critical account/other credits existing (not at this bank) | Radio or Television | 1169 | Unkown/No Savings Account | Exp >= 7 |
| 0 <= Checking < 200 | 48 | existing credits paid back duly till now | Radio or Television | 5951 | Less than 100 | 1 <= Exp < 4 |
| No Checking account | 12 | critical account/other credits existing (not at this bank) | Education | 2096 | Less than 100 | 4 <= Exp <7 |
| < 0 | 42 | existing credits paid back duly till now | Furniture/Equipment | 7882 | Less than 100 | 4 <= Exp <7 |
| < 0 | 24 | delay in paying off in the past | New Car | 4870 | Less than 100 | 1 <= Exp < 4 |
However, I also had to make sure the data was in a machine readable format for the later part of this project. The preprocessing was completed using the superML package in R. The following R-Code chunk helps us convert this character-based data frame into numeric data for predictive modeling.
# Reading the clean data.
data <- read.csv("data/eda-german-credit.csv")
# Defining a new variable which takes col names of qualitative columns
catColumns <- c("Checking.Account", "Credit.History", "Purpose",
"Savings.Account.Bonds", "Present.employee",
"Other.Debters", "Property", "Other.Installment.plans",
"Housing", "Job", "Telephone", "Foreign.Worker", "Outcome",
"Sex", "Personal.Status")
tf_data <- data.frame(data)
# Transforming each qualitative column into numerical labels
for (column in catColumns){
label <- LabelEncoder$new()
tf_data[, column] <- label$fit_transform(tf_data[, column])
}
# Saving the data into a new file for later use
write.csv(tf_data, "data/machine-ready-credit.csv", row.names = FALSE)
The data after being converted for machine readable format looked like this. As you would’ve expected the data had no character variables.
| Checking.Account | Duration | Credit.History | Purpose | Credit.Amount | Savings.Account.Bonds | Present.employee |
|---|---|---|---|---|---|---|
| 0 | 6 | 0 | 0 | 1169 | 0 | 0 |
| 1 | 48 | 1 | 0 | 5951 | 1 | 1 |
| 2 | 12 | 0 | 1 | 2096 | 1 | 2 |
| 0 | 42 | 1 | 2 | 7882 | 1 | 2 |
| 0 | 24 | 2 | 3 | 4870 | 1 | 1 |
Since the data is now more descriptive, I attempted to plot various graphs, most of which are interactive to derive inferences. There are significant parts of this data analysis module.
Each of these categories aims better to understand the distribution of the data across various demographics. There are also some miscellaneous plots I have included, which I thought would help me derive other inferences.
I initially wanted to understand the target variable’s (whether if the loan provided was a good decision or a bad one) distribution. The bar graph shown below lets us understand it better.
From this graph, we understand that there is a class imbalance in our target variable. Though the number of people who defaulted their loan is lesser than the customers who paid back their credits duly it is still a high ratio, and it is our aim to reduce the number of defaulted loan decisions.
To get an idea of how the population of our dataset was distributed, I plotted a histogram that shows the distribution of the age group across the two genders and as a whole.
data_age = data[, "Age.in.Years"]
age_male = data[data$Sex == "Male ", "Age.in.Years"]
age_female = data[data$Sex == "Female ", "Age.in.Years"]
trace0 <- plot_ly(
x = age_male,
type = "histogram",
histnorm='probability',
name="Male Age",
marker = list(
color = 'rgba(100, 149, 247, 0.7)'
)
)
trace0 <- trace0 %>%
layout(bargap = 0.05)
trace1 <- plot_ly(
x = age_female,
type = "histogram",
histnorm='probability',
name="Female Age",
marker = list(
color = 'rgba(255, 105, 180, 0.7)'
)
)
trace1 <- trace1 %>%
layout(bargap = 0.05)
trace2 <- plot_ly(
x = data_age,
type = "histogram",
histnorm='probability',
name="Distribution",
marker = list(
color = 'rgba(195, 177, 225, 0.7)'
)
)
trace2 <- trace2 %>%
layout(bargap = 0.05)
sub1 <- subplot(trace0, trace1)
fig1 <- subplot(sub1, trace2, shareX = TRUE, nrows = 2)
fig1 <- fig1 %>% layout(
title = 'Age Distribution',
font=t,
plot_bgcolor = "#e5ecf6"
)
It is understood that the age group of people who wish to take a loan are in their 20s and 30s. This is irrespective of gender which can be seen in the overall distribution.
In the next plot I plot the reasons why men and women take a loan. To visualize this I have plotted a horizontal grouped bar graph that shows the distribution of men and women across various purposes.
purpose_list <- c(" New Car", "Business", "Domestic Appliances", "Education",
"Furniture/Equipment", "Others", "Radio or Television",
"Repairs", "Retraining", "Used Car")
purpose_male <- unname(prop.table(table(data[data$Sex == "Male ", "Purpose"])))
purpose_female <- unname(prop.table(table(data[data$Sex == "Female ", "Purpose"])))
# VARIATION OF GENDER WITH PURPOSE
trace0 <- plot_ly(
x = purpose_male,
y = purpose_list,
name = "Male Purpose",
type = "bar",
marker = list(color = 'rgba(100, 149, 247, 0.7)',
line = list(color = 'rgb(8,48,107)',
width = 1.5))
)
fig2 <- trace0 %>% add_trace(
x = purpose_female,
y = purpose_list,
name = "Female Purpose",
type = "bar",
marker = list(color = 'rgba(255, 105, 180, 0.7)',
line = list(color = 'rgb(8,48,107)',
width = 1.5))
)
fig2 <- fig2 %>% layout(
yaxis = list(title = 'Reason'),
xaxis = list(title = "Percentage of Gender"),
barmode = 'group'
)
The graph is plotted by taking the percentage of the number of men and women for different purposes, and then plotting them side by side. From this, it is inferred that in general, for all categories other than furniture and domestic appliances, there is a higher percentage of men who take a loan than women.
In the following graph, I plot the distribution of the credit amount that men and women have in their bank accounts. The x-axis plots the amount of money in Deutsche Mark and the y axis plot the count of the same.
trace3 <- ggplot(data[data$Sex == "Male ", ], aes(x =Credit.Amount)) +
geom_histogram(alpha=0.5, position="identity", bins = 50,
color="blue", fill = "lightblue") +
geom_vline(aes(xintercept=mean(data[data$Sex == "Male ", "Credit.Amount"])),
color="black", linetype="dashed", size=0.2)
trace4 <- ggplot(data[data$Sex == "Female ", ], aes(x =Credit.Amount)) +
geom_histogram(alpha=0.5, position="identity", bins = 50,
color="red", fill = "pink") +
geom_vline(aes(xintercept=mean(data[data$Sex == "Female ", "Credit.Amount"])),
color="black", linetype="dashed", size=0.3)
fig3 <- subplot(trace3, trace4, nrows = 2, shareX = TRUE)
It could be hypothesized that the gender of a person does not affect their credit amount and that majority of the population has a credit amount between 1000 DMK and 2500 DMK.
Finally, for our gender analysis I have attempted to see if gender affects a person’s loan outcome. The next plot shows the count of men and women who have good and bad outcomes respectively.
###############################################################################
# #
# DISTTRIBUTION OF GENDER VS RISK #
# #
###############################################################################
trace5 <- plot_ly(data = data,
x = names(table(data[data$Outcome == "Good", "Sex"])),
y = table(data[data$Outcome == "Good", "Sex"]),
type = 'bar',
name = 'Good Credit')
fig5 <- trace5 %>%
add_trace(y = table(data[data$Outcome == "Bad", "Sex"]),
name = 'Bad Credit')
fig4 <- fig5 %>%
layout(xaxis = list(title = 'Reason'),
yaxis = list(title = 'Count'),
barmode = 'group')
It could be understood from the above plot that men, in general, have a higher ratio of good to bad outcomes than women. However, the data might not entirely represent the general population as there is an imbalance between men and women.
Summary of Gender Analysis
I attempted to see if the various age groups have better or lesser risk in the age analysis module. I also try to look at the credit amount distribution, but I do not look at outliers in this analysis.
The population has been split into 4 mmajor age groups equaly.
For our first plot I plotted a stacked histogram with age distribution for people with good and bad credit.
# Function takes data, feature to plot the distribution of, and separating label
plot_multi_histogram <- function(df, feature, label_column) {
plt <- ggplot(df, aes(x=eval(parse(text=feature)),
fill=eval(parse(text=label_column)))) +
geom_histogram(alpha=0.5, position="identity",
aes(y = ..density..), color="black") +
geom_density(alpha=0.1) +
geom_vline(aes(xintercept=mean(eval(parse(text=feature)))),
color="black", linetype="dashed", size=1) +
labs(x=feature,
y = "Density")
plt + guides(fill=guide_legend(title=label_column))
}
fig1 <- plot_multi_histogram(data, "Age.in.Years", "Outcome")
It can be inferred from this graph that most of the younger population are the people with a bad risk. However, the graph is also right-skewed for good credit.
However, the age group of people with good credit lie in their late 20s and early 30s.
The following graph is a box plot of various age groups against the credit amount. This way, I will see if different age groups are more or less affluent than the other groups.
#Let's look the Credit Amount column
interval = c(18, 25, 35, 60, 120)
cats = c('Young', 'Young Adult', 'Adult', 'Senior')
data["Age_Group"] = cut(data$Age.in.Years, interval, labels=cats)
data_good = data.frame(data[data["Outcome"] == 'Good', ])
data_bad = data.frame(data[data["Outcome"] == 'Bad', ])
fig2 <- plot_ly(
y = data_good$Credit.Amount,
x = data_good$Age_Group,
name="Good credit",
color = '#3D9970',
type = "box"
)
fig2 <- fig2 %>%
add_trace(
y = data_bad$Credit.Amount,
x = data_bad$Age_Group,
name="Bad credit",
color = "Blue",
type = "box"
)
fig2 <- fig2 %>%
layout(
yaxis=list(
title='Credit Amount (US Dollar)',
zeroline=F
),
xaxis=list(
title='Age Categorical'
),
boxmode='group'
)
Young adults and adults have a higher credit than other age groups. The graph also shows that, in general, people with lesser credit amounts have bad outcomes.
Another representation of the same is a violin plot as shown below. The violin plot also provides us with similar inferences as the box plot.
fig3 <- data %>%
plot_ly(type = 'violin')
fig3 <- fig3 %>%
add_trace(
x = data_good$Age_Group,
y = data_good$Credit.Amount,
legendgroup = 'Good Credit',
scalegroup = 'Good Credit',
name = 'Good Credit',
side = 'negative',
box = list(
visible = T
),
meanline = list(
visible = T
),
color = I("blue")
)
fig3 <- fig3 %>%
add_trace(
x = data_bad$Age_Group,
y = data_bad$Credit.Amount,
legendgroup = 'Bad Credit',
scalegroup = 'Bad Credit',
name = 'Bad Credit',
side = 'positive',
box = list(
visible = T
),
meanline = list(
visible = T
),
color = I("red")
)
fig3 <- fig3 %>%
layout(
xaxis = list(title = "Age Categories"),
yaxis = list(
title = "Credit Amount",
zeroline = F
)
)
Finally I plot a stacked bar graph against good and bad loans for different age groups to understand the ratio of good and bad credits.
young_good = sum(na.omit(data_good$Credit.Amount[data$Age_Group == "Young"]))
young_bad = sum(na.omit(data_bad$Credit.Amount[data$Age_Group == "Young"]))
young_adult_good = sum(na.omit(data_good$Credit.Amount[data$Age_Group == "Young Adult"]))
young_adult_bad = sum(na.omit(data_bad$Credit.Amount[data$Age_Group == "Young Adult"]))
adult_good = sum(na.omit(data_good$Credit.Amount[data$Age_Group == "Adult"]))
adult_bad = sum(na.omit(data_bad$Credit.Amount[data$Age_Group == "Adult"]))
elder_good = sum(na.omit(data_good$Credit.Amount[data$Age_Group == "Senior"]))
elder_bad = sum(na.omit(data_bad$Credit.Amount[data$Age_Group == "Senior"]))
young_good_p = young_good/(young_good + young_bad) * 100
young_bad_p = young_bad/(young_good + young_bad) * 100
young_adult_good_p = young_adult_good/(young_adult_good + young_adult_bad) * 100
young_adult_bad_p = young_adult_bad/(young_adult_good + young_adult_bad) * 100
adult_good_p = adult_good/(adult_good + adult_bad) * 100
adult_bad_p = adult_bad/(adult_good + adult_bad) * 100
elder_good_p = elder_good/(elder_good + elder_bad) * 100
elder_bad_p = elder_bad/(elder_good + elder_bad) * 100
young_good_p = round(young_adult_bad_p, 3)
young_bad_p = round(young_bad_p, 3)
young_adult_good_p = round(young_adult_good_p, 3)
young_adult_bad_p = round(young_adult_bad_p, 3)
adult_good_p = round(adult_good_p, 3)
adult_bad_p = round(adult_bad_p, 3)
elder_good_p = round(elder_good_p, 3)
elder_bad_p = round(elder_bad_p, 3)
good_text <- c(paste(young_good_p, '%'), paste(young_adult_good_p, '%'),
paste(adult_good_p, '%'), paste(elder_good_p, '%'))
bad_text <- c(paste(young_bad_p, '%'), paste(young_adult_bad_p, '%'),
paste(adult_bad_p, '%'), paste(elder_bad_p, '%'))
good_loans <- plot_ly(
x=cats,
y=c(young_good, young_adult_good, adult_good, elder_good),
type = "bar",
name="Good Loans",
text=good_text,
textposition = 'auto',
marker=list(
color='rgb(111, 235, 146)',
line=list(
color='rgb(60, 199, 100)',
width=1.5)
),
opacity=0.6
)
overall_loan <- good_loans %>% add_trace(
x=cats,
y = c(young_bad, young_adult_bad, adult_bad, elder_bad),
text=bad_text,
type = "bar",
name="Bad Loans",
textposition = 'auto',
marker=list(
color='rgb(247, 98, 98)',
line=list(
color='rgb(225, 56, 56)',
width=1.5)
),
opacity=0.6
)
overall_loan <- overall_loan %>% layout(
title="Type of Loan by Age Group",
xaxis = list(title="Age Group"),
yaxis= list(title="Credit Amount")
)
From this graph, it is understood that young adults have the highest ratio of good to bad risk outcomes. On the other hand, seniors surprisingly have the lowest ratio of good to bad risk outcomes.
Summary of Age Analysis
Within this section I try to understand how people from different wealth classes are distributed in our data-set.
trace5 <- plot_ly(data = data,
x = names(table(data[data$Outcome == "Good", "Savings.Account.Bonds"])),
y = table(data[data$Outcome == "Good", "Savings.Account.Bonds"]),
type = 'bar',
name = 'Good Credit')
fig6 <- trace5 %>%
add_trace(y = table(data[data$Outcome == "Bad", "Savings.Account.Bonds"]),
name = 'Bad Credit')
fig6 <- fig6 %>%
layout(xaxis = list(title = 'Number of Bonds'),
yaxis = list(title = 'Count'),
barmode = 'group')
Surprisingly people from the higher class, i.e., with more savings, have a lower ratio of good to bad outcomes. Also, people from the lower savings sector have a higher ratio of good to bad credit. However, the highest class of people have the highest good to bad credit ratio.
A similar distribution can be seen based on people of different types of credit payment.
trace5 <- plot_ly(data = data,
x = names(table(data[data$Outcome == "Good", "Credit.History"])),
y = table(data[data$Outcome == "Good", "Credit.History"]),
type = 'bar',
name = 'Good Credit')
fig7 <- trace5 %>%
add_trace(y = table(data[data$Outcome == "Bad", "Credit.History"]),
name = 'Bad Credit')
fig7 <- fig7 %>%
layout(barmode = 'group')
For the data provided, the job attribute is split into different levels of skill and industry. In this analysis, I plotted two different plots. One with the different types of jobs and their credit amount.
fig5 <- data %>%
plot_ly(type = 'violin')
fig5 <- fig5 %>%
add_trace(
x = data_good$Job,
y = data_good$Credit.Amount,
legendgroup = 'Good Credit',
scalegroup = 'Good Credit',
name = 'Good Credit',
side = 'negative',
box = list(
visible = T
),
meanline = list(
visible = T
),
color = I("blue")
)
fig5 <- fig5 %>%
add_trace(
x = data_bad$Job,
y = data_bad$Credit.Amount,
legendgroup = 'Bad Credit',
scalegroup = 'Bad Credit',
name = 'Bad Credit',
side = 'positive',
box = list(
visible = T
),
meanline = list(
visible = T
),
color = I("red")
)
fig5 <- fig5 %>%
layout(
xaxis = list(
title = ""
),
yaxis = list(
title = "Credit Amount (German DMK)",
zeroline = F
)
)
fig4 <- plot_ly(
y = data_good$Credit.Amount,
x = data_good$Job,
name="Good credit",
color = '#3D9970',
type = "box"
)
fig4 <- fig4 %>%
add_trace(
y = data_bad$Credit.Amount,
x = data_bad$Job,
name="Bad credit",
color = "Blue",
type = "box"
)
fig4 <- fig4 %>%
layout(
yaxis=list(
title='Credit Amount (German DMK)',
zeroline=F
),
boxmode='group'
)
It can be postulated from the two graphs above that Self-employed, or highly qualified professionals have high good and bad outcomes. The people with high credit amounts are also people in highly qualified or self-employed professionals. It is my opinion that since this part of this attribute includes self-employed people, they might take loans for their businesses, and these businesses might not have been able to pay back their loans. The previous statement might also explain why they have so much credit amount.
Summary of Wealth and Job Analysis
The following two graphs are plotted majorly look at the distribution of various types of home owners in good and bad outcomes.
fig2 <- plot_ly(data = data,
x = names(table(data[data$Outcome == "Good", "Housing"])),
y = table(data[data$Outcome == "Good", "Housing"]),
type = 'bar',
name = 'Good Credit')
fig2 <- fig2 %>%
add_trace(y = table(data[data$Outcome == "Bad", "Housing"]),
name = 'Bad Credit')
fig2 <- fig2 %>%
layout(xaxis = list(title = 'Housing Type'),
yaxis = list(title = 'Count'),
barmode = 'group')
We can see that people who live for free have lowest ratio of good to bad outcome for a loan payment, and that home owners have highest ratio of the same.
Finally, to understand why people wish to take up the loan I plotted various box plots. The graph is as shown below.
fig3 <- ggplot(data, aes(x=Purpose, y=Credit.Amount, fill = Purpose)) +
geom_boxplot() +
labs(title="Distribution of Credit VS Purpose",x="Purpose",
y = "Credit ammount (DK)") +
theme_classic() +
theme(axis.text.x = element_text(angle = 10))
Though most of the population who take up loans for other purposes take up the highest amount of money for their loans. It can also be seen that the following type of people to take up loans are the ones who wish to pay their car loans. The people who take up loans for domestic appliances are the ones who take the lowest amount for their loan.
This concludes our Exploratory data analysis section. We will now move on to predictive modelling using various machine learning techniques
Within this section I employed various machine learning algorithms to classify whether if a person will default their loan or not. For predictive modelling I had employed the Python Programming Language to implement ML algorithms because of its better support for the same.
Some of the algorithms I used are
We will be looking at the precision, recall and f1 score for these algorithms. The code for this can be found in the interactive python notebook at the project repository
| NAME.OF.ML.ALGORITHM.USED | PRECISION.0 | PRECISION.1 | RECALL.0 | RECALL.1 | F1.SCORE.0 | F1.SCORE.1 |
|---|---|---|---|---|---|---|
| DECISION TREES | 0.77 | 0.39 | 0.71 | 0.46 | 0.74 | 0.42 |
| LOGISTIC REGRESSION | 0.79 | 0.65 | 0.91 | 0.42 | 0.85 | 0.51 |
| RANDOM FOREST | 0.78 | 0.66 | 0.92 0 | 0.38 | 0.85 | 0.48 |
| XGBOOST | 0.82 | 0.69 | 0.9 | 0.51 | 0.86 | 0.59 |
| QUADRATIC DISCRIMINANT ANALYSIS | 0.83 | 0.57 | 0.82 | 0.58 | 0.82 | 0.58 |
| SUPPORT VECTOR CLASSIFIERS | 0.77 | 0.66 | 0.94 | 0.29 | 0.84 | 0.40 |
Though the results are not as great, I hope to implement a fine-tuned Deep-learning model that provides us with better results.
The German Credit data was read from the UCU machine learning repository. Initial data pre-processing was implemented to bring about clean and understandable data. Then the data was used to perform exploratory data analysis and derive inferences.
Finally, the machine-ready data were scaled and used for predicting if a person would default on their loans or not. Various machine learning algorithms were used for this purpose, and the XGBoost model performed the best compared to other algorithms.
I wish to in the future make a deep learning algorithm for this data. We can also implement the trained algorithms for a Shiny Web application. However, most of all, I wish to implement the machine learning algorithms using R Programming. Since the data is also not representative one could implement data augmentation to make the data more descriptive.