Read in Bank Loan
Default Dataset
Read in dataset.
bank <- read.csv("BankLoanDefaultDataset.csv")
Below looks at each variable and see what type they all are.
str(bank)
'data.frame': 1000 obs. of 16 variables:
$ Default : int 0 0 0 1 1 0 0 0 0 1 ...
$ Checking_amount : int 988 458 158 300 63 1071 -192 172 585 189 ...
$ Term : int 15 15 14 25 24 20 13 16 20 19 ...
$ Credit_score : int 796 813 756 737 662 828 856 763 778 649 ...
$ Gender : chr "Female" "Female" "Female" "Female" ...
$ Marital_status : chr "Single" "Single" "Single" "Single" ...
$ Car_loan : int 1 1 0 0 0 1 1 1 1 1 ...
$ Personal_loan : int 0 0 1 0 0 0 0 0 0 0 ...
$ Home_loan : int 0 0 0 0 0 0 0 0 0 0 ...
$ Education_loan : int 0 0 0 1 1 0 0 0 0 0 ...
$ Emp_status : chr "employed" "employed" "employed" "employed" ...
$ Amount : int 1536 947 1678 1804 1184 475 626 1224 1162 786 ...
$ Saving_amount : int 3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
$ Emp_duration : int 12 25 43 0 4 12 11 12 12 0 ...
$ Age : int 38 36 34 29 30 32 38 36 36 29 ...
$ No_of_credit_acc: int 1 1 1 1 1 2 1 1 1 1 ...
Below looks at the top few rows of dataset.
head(bank)
Default Checking_amount Term Credit_score Gender Marital_status Car_loan
1 0 988 15 796 Female Single 1
2 0 458 15 813 Female Single 1
3 0 158 14 756 Female Single 0
4 1 300 25 737 Female Single 0
5 1 63 24 662 Female Single 0
6 0 1071 20 828 Male Married 1
Personal_loan Home_loan Education_loan Emp_status Amount Saving_amount
1 0 0 0 employed 1536 3455
2 0 0 0 employed 947 3600
3 1 0 0 employed 1678 3093
4 0 0 1 employed 1804 2449
5 0 0 1 unemployed 1184 2867
6 0 0 0 employed 475 3282
Emp_duration Age No_of_credit_acc
1 12 38 1
2 25 36 1
3 43 34 1
4 0 29 1
5 4 30 1
6 12 32 2
Create Missing
Values
The original dataset does not have any missing values so I needed to
manually create them.
gender.missing.id <- sample(1:1000, 20 , replace = FALSE)
Marital.missing.id <- sample(1:1000, 20, replace = FALSE)
emp.status.missing.id <- sample(1:1000, 20, replace = FALSE)
credit.missing.id <- sample(1:1000, 20, replace = FALSE)
amount.missing.id <- sample(1:1000, 20, replace = FALSE)
emp.duration.missing.id <- sample(1:1000, 20, replace = FALSE)
check.amt.missing.id <- sample(1:1000, 20, replace = FALSE)
age.missing.id <- sample(1:1000, 20, replace = FALSE)
bank$Gender[gender.missing.id] <- NA
bank$Marital_status[Marital.missing.id] <- NA
bank$Emp_status[emp.status.missing.id] <- NA
bank$Credit_score[credit.missing.id] <- NA
bank$Amount[amount.missing.id] <- NA
bank$Emp_duration[emp.duration.missing.id] <- NA
bank$Checking_amount[check.amt.missing.id] <- NA
bank$Age[age.missing.id] <- NA
Description and Purpose
of Dataset and Analytic Tasks
This dataset is titled “Bank Loan Default Dataset” and shows various
explanatory variables and one target/response variable (Default). The
purpose of collecting this dataset is to see what factors may cause
peoples loans to be in default or not. This dataset was collected from
your Course Project Data Repository. This dataset contains 1000
observations and 16 variables (15 feature variables and 1 target
variable). The Checking_amount is a numerical variable that shows the
amount of money in ones checking account. The Saving_amount is a
numerical variable that shows the amount of money in ones saving
account. The term (numerical) is the duration of the loan term. The
credit_score (numerical) shows ones credit score. The gender
(categorical) consists of male and female. The marital_status
(categorical) consists of married or single. The car_loan,
personal_loan, home_loan, and education_loan (all categorical/binary)
shows if people have loans in any of those areas. The emp_status
(categorical) consists of unemployed or employed. The amount (numerical)
shows the amount of the loan. The emp_duration (num) shows length of
employment in months. Age (num) shows age and no_of_credit_account (num)
shows number of credit accounts. The overall goal of this project is to
see how significant the feature variables are in predicting if ones loan
will be in default (1 if default, 0 if not). The first part of the
project is doing EDA.
The original dataset did not have any missing values so I needed to
manually create them. For this dataset, the variables Gender,
Marital_status, Emp_status, credit_score, amount, emp_duration,
checking_amount, and age have missing values. Missing numerical values
can be resolved by imputing the mean. Missing categorical values can be
resolved by imputing the mode.
Distribution of
Individual Features
This section will show the distribution of each individual feature.
Some features will have missing values and can be resolved by imputing
the mean or mode.
The below figure shows the distribution of the Gender variable. There
are significantly more males than females.
ggplot(bank, aes(x = Gender)) +
geom_bar() +
labs(title = "Gender")

The below figure shows the distribution of the Marital_status
variable. Married and single people have similar counts.
ggplot(bank, aes(x = Marital_status)) +
geom_bar() +
labs(title = "Marital_status")

The below figure shows the distribution of the Emp_status variable.
There are significantly more unemployed than employed.
ggplot(bank, aes(x = Emp_status)) +
geom_bar() +
labs(title = "Emp_status")

The below figure shows the distribution of the car loan variable.
There are significantly more that do not have a car loan than those that
do.
ggplot(bank, aes(x = Car_loan)) +
geom_bar() +
labs(title = "Car_loan")

The below figure shows the distribution of the personal loan
variable. Those that have a personal loan and those that do not have
similar counts.
ggplot(bank, aes(x = Personal_loan)) +
geom_bar() +
labs(title = "Personal_loan")

The below figure shows the distribution of the education loan
variable. There are significantly more that do not have an education
loan than those that do.
ggplot(bank, aes(x = Education_loan)) +
geom_bar() +
labs(title = "Education_loan")

The below figure shows the distribution of the home loan variable.
There are significantly more that do not have a home loan than those
that do.
ggplot(bank, aes(x = Home_loan)) +
geom_bar() +
labs(title = "Home_loan")

The below figure shows the distribution of the credit score variable.
There are no alarming trends outside of a couple anomalies.
ggplot(data = bank, aes(x = Credit_score)) +
geom_boxplot() +
labs(title = "Credit_score")

The below figure shows the distribution of the checking amount
variable. There are no alarming trends.
ggplot(data = bank, aes(x = Checking_amount)) +
geom_boxplot() +
labs(title = "Checking_amount")

The below figure shows the distribution of the term variable. There
are no alarming trends.
ggplot(data = bank, aes(x = Term)) +
geom_boxplot() +
labs(title = "Term")

The below figure shows the distribution of the amount variable. There
are no alarming trends.
ggplot(data = bank, aes(x = Amount)) +
geom_boxplot() +
labs(title = "Amount")

The below figure shows the distribution of the Saving amount
variable. There are no alarming trends.
ggplot(data = bank, aes(x = Saving_amount)) +
geom_boxplot() +
labs(title = "Saving amount")

The below figure shows the distribution of the Emp_duration variable.
There are no alarming trends.
ggplot(data = bank, aes(x = Emp_duration)) +
geom_boxplot() +
labs(title = "Emp duration")

The below figure shows the distribution of the age variable. There
are no alarming trends.
ggplot(data = bank, aes(x = Age)) +
geom_boxplot() +
labs(title = "Age")

The below figure shows the distribution of the No_of_credit_acc
variable. This variable seems to be heavily skewed.
ggplot(data = bank, aes(x = No_of_credit_acc)) +
geom_boxplot() +
labs(title = "No_of_credit_acc")

One Categorical Feature
and One Numerical feature Graphs
In this section, we will show the relationship between one
categorical feature and one numerical feature.
The below figure shows the relationship between credit score and
gender. Based on the graph, the credit score ranges look to be similar
across both genders. There are some anomalies but I do not believe that
they will have a significant effect on the analysis. There are missing
values but they can be resolved using imputation.
ggplot(bank, aes(x=Credit_score, y=Gender, fill=Gender)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Gender")

The below figure shows the relationship between loan amount and
gender. Based on the graph, the loan amount ranges look to be similar
across both genders. There are some anomalies but I do not believe that
they will have a significant effect on the analysis. There are missing
values but they can be resolved using imputation.
ggplot(bank, aes(x=Amount, y=Gender, fill=Gender)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Gender")

The below figure shows the relationship between Employment duration
and gender. Based on the graph, it seems like males have a longer
employment duration than females. There are missing values but they can
be resolved using imputation.
ggplot(bank, aes(x=Emp_duration, y=Gender, fill=Gender)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by Gender")

The below figure shows the relationship between Age and gender. Based
on the graph, the Age ranges look to be similar across both genders.
There are some anomalies but I do not believe that they will have a
significant effect on the analysis. There are missing values but they
can be resolved using imputation.
ggplot(bank, aes(x=Age, y=Gender, fill=Gender)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Age by Gender")

The below figure shows the relationship between marital status and
credit score. Based on the graph, the credit score ranges look to be
similar across both single and married people. There are some anomalies
but I do not believe that they will have a significant effect on the
analysis. There are missing values but they can be resolved using
imputation.
ggplot(bank, aes(x=Credit_score, y=Marital_status, fill=Marital_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Marital status")

The below figure shows the relationship between loan amount and
marital status. Based on the graph, the loan amount ranges look to be
similar across both marital statuses. There are some anomalies but I do
not believe that they will have a significant effect on the analysis.
There are missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Amount, y=Marital_status, fill=Marital_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Marital status")

The below figure shows the relationship between marital status and
employment duration. Based on the graph, it seems like married people
have longer employment duration than single people. There are missing
values but they can be resolved using imputation.
ggplot(bank, aes(x=Emp_duration, y=Marital_status, fill=Marital_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by Marital status")

The below figure shows the relationship between Age and marital
status. Based on the graph, the Age ranges look to be similar across
both marital statuses. There are some anomalies but I do not believe
that they will have a significant effect on the analysis. There are
missing values but they can be resolved using imputation.
ggplot(bank, aes(x=Age, y=Marital_status, fill=Marital_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by Marital status")

The below figure shows the relationship between credit score and
employment status. Based on the graph, the credit score ranges look to
be similar across both employed and unemployed people. There seems to be
more anomalies for unemployed people but I do not believe that they will
have a significant effect on the analysis. There are missing values but
they can be resolved using imputation.
ggplot(bank, aes(x=Credit_score, y=Emp_status, fill=Emp_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Employment status")

The below figure shows the relationship between loan amount and
employment status. Based on the graph, the loan amount ranges look to be
similar across both employed/unemployed. Again there seems to be more
anomalies for unemployed people than employed. There are missing values
but they can be resolved using imputation.
ggplot(bank, aes(x=Amount, y=Emp_status, fill=Emp_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Employment status")

The below figure shows the relationship between employment duration
and employment status. Based on the graph, it seems like unemployed
people have longer and more varied employment duration than employed
people. There are missing values but they can be resolved using
imputation.
ggplot(bank, aes(x=Emp_duration, y=Emp_status, fill=Emp_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by employment status")

The below figure shows the relationship between Age and employment
status. Based on the graph, the Age ranges look to be similar across
both statuses. There are some anomalies but I do not believe that they
will have a significant effect on the analysis. There are missing values
but they can be resolved using imputation.
ggplot(bank, aes(x=Age, y=Emp_status, fill=Emp_status)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by employment status")

The below figure shows the relationship between Age and car loan.
Based on the graph, the Age ranges look to be similar across both those
who own a car loan and those who do not.
# convert car, home, personal, and education loans into categorical to use for EDA purposes
bank$Car_loan <- as.factor(bank$Car_loan)
bank$Personal_loan <- as.factor(bank$Personal_loan)
bank$Home_loan <- as.factor(bank$Home_loan)
bank$Education_loan <- as.factor(bank$Education_loan)
ggplot(bank, aes(x=Age, y=Car_loan, fill=Car_loan)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by car loan")

The below figure shows the relationship between Age and personal
loan. Based on the graph, the Age ranges look to be similar across all
those that have a personal loan or not.
ggplot(bank, aes(x=Age, y=Personal_loan, fill=Personal_loan)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by personal_loan")

The below figure shows the relationship between Age and home loan.
Based on the graph, the Age ranges look to be similar across all those
that have a home loan or not.
ggplot(bank, aes(x=Age, y=Home_loan, fill=Home_loan)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by home_loan")

The below figure shows the relationship between Age and education
loan. Based on the graph, it seems like younger people have an education
loan and older people do not.
ggplot(bank, aes(x=Age, y=Education_loan, fill=Education_loan)) +
geom_boxplot() + theme(legend.position="none")+
ggtitle("age by education loan")

Two numerical feature
variables graphs
Next, we will examine the relationship between two numerical
features.
The below graph shows employment duration and credit score. Based on
the graph, it seems like regardless of the employment duration, the
majority of the credit scores seem to fall in the 700-900 range.
ggplot(data = bank, aes(x = Credit_score, y = Emp_duration)) +
geom_point() +
ggtitle("Employment Duration vs Credit Score")

The below graph shows employment duration and loan amount. There does
not seem to be any patterns based on this graph.
ggplot(data = bank, aes(x = Amount, y = Emp_duration)) +
geom_point() +
ggtitle("Employment Duration vs Loan Amount")

The below graph shows employment duration and age. There does not
seem to be any patterns based on this graph.
ggplot(data = bank, aes(x = Age, y = Emp_duration)) +
geom_point() +
ggtitle("Employment Duration vs Age")

The below graph shows age and Credit score. Again, it seems like
regardless of age, most of the credit scores seem to fall in that
700-900 range.
ggplot(data = bank, aes(x = Age, y = Credit_score)) +
geom_point() +
ggtitle("Age vs credit score")

The below graph shows checking amount and saving amount. There does
not seem to be any correlation between these two variables.
ggplot(data = bank, aes(x = Checking_amount, y = Saving_amount)) +
geom_point() +
ggtitle("Checking amount vs savings Amount")

The below graph shows age and loan amount. It seems like regardless
of age, most of the loan amounts seem to fall in a certain range.
ggplot(data = bank, aes(x = Age, y = Amount)) +
geom_point() +
ggtitle("age vs Loan Amount")

The below graph shows credit score and loan amount. It seems like
most of the points are clumped together.
ggplot(data = bank, aes(x = Credit_score, y = Amount)) +
geom_point() +
ggtitle("credit score vs Loan Amount")

Two categorical feature
variables graphs
This section shows the relationships between two categorical
features.
The below graph shows Gender and Employment status. It seems like the
amount of employed and unemployed females are similar but there are
significantly more unemployed males than employed.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Emp_status), position = "dodge")+ggtitle("gender vs employment status")

The below graph shows Gender and marital status. It seems like all
females are single and there are significantly more married males than
single.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Marital_status), position = "dodge")+ggtitle("gender vs marital status")

The below graph shows employment status and marital status. It seems
like the amount of employed and unemployed people are relatively similar
but for married people, there are significantly more who are
unemployed.
ggplot(bank, aes(Marital_status, ..count..)) + geom_bar(aes(fill = Emp_status), position = "dodge")+ggtitle("marital status vs enployment status")

The below graph shows gender and car loan. It seems like for both
males and females, there are significantly more that do not have car
loans than those that do.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Car_loan), position = "dodge")+ggtitle("gender vs car loan")

The below graph shows gender and education loan. It seems like for
both males and females, there are significantly more that do not have
education loans than those that do.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Education_loan), position = "dodge")+ggtitle("Gender vs education Loan")

The below graph shows gender and personal loan. It seems like for
both males and females, the amounts that who have personal loans and
those that do not are relatively similar.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Personal_loan), position = "dodge")+ggtitle("Gender vs personal Loan")

The below graph shows gender and home loan. It seems like for both
males and females, there are significantly more that do not have home
loans than those that do.
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Home_loan), position = "dodge")+ggtitle("Gender vs home Loan")

