1 Read in Bank Loan Default Dataset

Read in dataset.

bank <- read.csv("BankLoanDefaultDataset.csv")

Below looks at each variable and see what type they all are.

str(bank)
'data.frame':   1000 obs. of  16 variables:
 $ Default         : int  0 0 0 1 1 0 0 0 0 1 ...
 $ Checking_amount : int  988 458 158 300 63 1071 -192 172 585 189 ...
 $ Term            : int  15 15 14 25 24 20 13 16 20 19 ...
 $ Credit_score    : int  796 813 756 737 662 828 856 763 778 649 ...
 $ Gender          : chr  "Female" "Female" "Female" "Female" ...
 $ Marital_status  : chr  "Single" "Single" "Single" "Single" ...
 $ Car_loan        : int  1 1 0 0 0 1 1 1 1 1 ...
 $ Personal_loan   : int  0 0 1 0 0 0 0 0 0 0 ...
 $ Home_loan       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Education_loan  : int  0 0 0 1 1 0 0 0 0 0 ...
 $ Emp_status      : chr  "employed" "employed" "employed" "employed" ...
 $ Amount          : int  1536 947 1678 1804 1184 475 626 1224 1162 786 ...
 $ Saving_amount   : int  3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
 $ Emp_duration    : int  12 25 43 0 4 12 11 12 12 0 ...
 $ Age             : int  38 36 34 29 30 32 38 36 36 29 ...
 $ No_of_credit_acc: int  1 1 1 1 1 2 1 1 1 1 ...

Below looks at the top few rows of dataset.

head(bank)
  Default Checking_amount Term Credit_score Gender Marital_status Car_loan
1       0             988   15          796 Female         Single        1
2       0             458   15          813 Female         Single        1
3       0             158   14          756 Female         Single        0
4       1             300   25          737 Female         Single        0
5       1              63   24          662 Female         Single        0
6       0            1071   20          828   Male        Married        1
  Personal_loan Home_loan Education_loan Emp_status Amount Saving_amount
1             0         0              0   employed   1536          3455
2             0         0              0   employed    947          3600
3             1         0              0   employed   1678          3093
4             0         0              1   employed   1804          2449
5             0         0              1 unemployed   1184          2867
6             0         0              0   employed    475          3282
  Emp_duration Age No_of_credit_acc
1           12  38                1
2           25  36                1
3           43  34                1
4            0  29                1
5            4  30                1
6           12  32                2

1.1 Create Missing Values

The original dataset does not have any missing values so I needed to manually create them.

gender.missing.id <- sample(1:1000, 20 , replace = FALSE)
Marital.missing.id <- sample(1:1000, 20, replace = FALSE)
emp.status.missing.id <- sample(1:1000, 20, replace = FALSE)
credit.missing.id <- sample(1:1000, 20, replace = FALSE)
amount.missing.id <- sample(1:1000, 20, replace = FALSE)
emp.duration.missing.id <- sample(1:1000, 20, replace = FALSE)
check.amt.missing.id <- sample(1:1000, 20, replace = FALSE)
age.missing.id <- sample(1:1000, 20, replace = FALSE)

bank$Gender[gender.missing.id] <- NA
bank$Marital_status[Marital.missing.id] <- NA
bank$Emp_status[emp.status.missing.id] <- NA
bank$Credit_score[credit.missing.id] <- NA
bank$Amount[amount.missing.id] <- NA
bank$Emp_duration[emp.duration.missing.id] <- NA
bank$Checking_amount[check.amt.missing.id] <- NA
bank$Age[age.missing.id] <- NA

2 Description and Purpose of Dataset and Analytic Tasks

This dataset is titled “Bank Loan Default Dataset” and shows various explanatory variables and one target/response variable (Default). The purpose of collecting this dataset is to see what factors may cause peoples loans to be in default or not. This dataset was collected from your Course Project Data Repository. This dataset contains 1000 observations and 16 variables (15 feature variables and 1 target variable). The Checking_amount is a numerical variable that shows the amount of money in ones checking account. The Saving_amount is a numerical variable that shows the amount of money in ones saving account. The term (numerical) is the duration of the loan term. The credit_score (numerical) shows ones credit score. The gender (categorical) consists of male and female. The marital_status (categorical) consists of married or single. The car_loan, personal_loan, home_loan, and education_loan (all categorical/binary) shows if people have loans in any of those areas. The emp_status (categorical) consists of unemployed or employed. The amount (numerical) shows the amount of the loan. The emp_duration (num) shows length of employment in months. Age (num) shows age and no_of_credit_account (num) shows number of credit accounts. The overall goal of this project is to see how significant the feature variables are in predicting if ones loan will be in default (1 if default, 0 if not). The first part of the project is doing EDA.

The original dataset did not have any missing values so I needed to manually create them. For this dataset, the variables Gender, Marital_status, Emp_status, credit_score, amount, emp_duration, checking_amount, and age have missing values. Missing numerical values can be resolved by imputing the mean. Missing categorical values can be resolved by imputing the mode.

3 Distribution of Individual Features

This section will show the distribution of each individual feature. Some features will have missing values and can be resolved by imputing the mean or mode.

The below figure shows the distribution of the Gender variable. There are significantly more males than females.

ggplot(bank, aes(x = Gender)) + 
  
  geom_bar() +
  labs(title = "Gender")

The below figure shows the distribution of the Marital_status variable. Married and single people have similar counts.

ggplot(bank, aes(x = Marital_status)) + 
  
  geom_bar() +
  labs(title = "Marital_status")

The below figure shows the distribution of the Emp_status variable. There are significantly more unemployed than employed.

ggplot(bank, aes(x = Emp_status)) + 
  
  geom_bar() +
  labs(title = "Emp_status")

The below figure shows the distribution of the car loan variable. There are significantly more that do not have a car loan than those that do.

ggplot(bank, aes(x = Car_loan)) + 
  
  geom_bar() +
  labs(title = "Car_loan")

The below figure shows the distribution of the personal loan variable. Those that have a personal loan and those that do not have similar counts.

ggplot(bank, aes(x = Personal_loan)) + 
  
  geom_bar() +
  labs(title = "Personal_loan")

The below figure shows the distribution of the education loan variable. There are significantly more that do not have an education loan than those that do.

ggplot(bank, aes(x = Education_loan)) + 
  
  geom_bar() +
  labs(title = "Education_loan")

The below figure shows the distribution of the home loan variable. There are significantly more that do not have a home loan than those that do.

ggplot(bank, aes(x = Home_loan)) + 
  
  geom_bar() +
  labs(title = "Home_loan")

The below figure shows the distribution of the credit score variable. There are no alarming trends outside of a couple anomalies.

ggplot(data = bank, aes(x = Credit_score)) + 
  geom_boxplot() + 
  
  labs(title = "Credit_score")

The below figure shows the distribution of the checking amount variable. There are no alarming trends.

ggplot(data = bank, aes(x = Checking_amount)) + 
  geom_boxplot() + 
  
  labs(title = "Checking_amount")

The below figure shows the distribution of the term variable. There are no alarming trends.

ggplot(data = bank, aes(x = Term)) + 
  geom_boxplot() + 
  
  labs(title = "Term")

The below figure shows the distribution of the amount variable. There are no alarming trends.

ggplot(data = bank, aes(x = Amount)) + 
  geom_boxplot() + 
  
  labs(title = "Amount")

The below figure shows the distribution of the Saving amount variable. There are no alarming trends.

ggplot(data = bank, aes(x = Saving_amount)) + 
  geom_boxplot() + 
  
  labs(title = "Saving amount")

The below figure shows the distribution of the Emp_duration variable. There are no alarming trends.

ggplot(data = bank, aes(x = Emp_duration)) + 
  geom_boxplot() + 
  
  labs(title = "Emp duration")

The below figure shows the distribution of the age variable. There are no alarming trends.

ggplot(data = bank, aes(x = Age)) + 
  geom_boxplot() + 
  
  labs(title = "Age")

The below figure shows the distribution of the No_of_credit_acc variable. This variable seems to be heavily skewed.

ggplot(data = bank, aes(x = No_of_credit_acc)) + 
  geom_boxplot() + 
  
  labs(title = "No_of_credit_acc")

4 One Categorical Feature and One Numerical feature Graphs

In this section, we will show the relationship between one categorical feature and one numerical feature.

The below figure shows the relationship between credit score and gender. Based on the graph, the credit score ranges look to be similar across both genders. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Credit_score, y=Gender, fill=Gender)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Gender")

The below figure shows the relationship between loan amount and gender. Based on the graph, the loan amount ranges look to be similar across both genders. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Amount, y=Gender, fill=Gender)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Gender")

The below figure shows the relationship between Employment duration and gender. Based on the graph, it seems like males have a longer employment duration than females. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Emp_duration, y=Gender, fill=Gender)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by Gender")

The below figure shows the relationship between Age and gender. Based on the graph, the Age ranges look to be similar across both genders. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Age, y=Gender, fill=Gender)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Age by Gender")

The below figure shows the relationship between marital status and credit score. Based on the graph, the credit score ranges look to be similar across both single and married people. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Credit_score, y=Marital_status, fill=Marital_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Marital status")

The below figure shows the relationship between loan amount and marital status. Based on the graph, the loan amount ranges look to be similar across both marital statuses. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Amount, y=Marital_status, fill=Marital_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Marital status")

The below figure shows the relationship between marital status and employment duration. Based on the graph, it seems like married people have longer employment duration than single people. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Emp_duration, y=Marital_status, fill=Marital_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by Marital status")

The below figure shows the relationship between Age and marital status. Based on the graph, the Age ranges look to be similar across both marital statuses. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Age, y=Marital_status, fill=Marital_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by Marital status")

The below figure shows the relationship between credit score and employment status. Based on the graph, the credit score ranges look to be similar across both employed and unemployed people. There seems to be more anomalies for unemployed people but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Credit_score, y=Emp_status, fill=Emp_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Employment status")

The below figure shows the relationship between loan amount and employment status. Based on the graph, the loan amount ranges look to be similar across both employed/unemployed. Again there seems to be more anomalies for unemployed people than employed. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Amount, y=Emp_status, fill=Emp_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Employment status")

The below figure shows the relationship between employment duration and employment status. Based on the graph, it seems like unemployed people have longer and more varied employment duration than employed people. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Emp_duration, y=Emp_status, fill=Emp_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by employment status")

The below figure shows the relationship between Age and employment status. Based on the graph, the Age ranges look to be similar across both statuses. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

ggplot(bank, aes(x=Age, y=Emp_status, fill=Emp_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by employment status")

The below figure shows the relationship between Age and car loan. Based on the graph, the Age ranges look to be similar across both those who own a car loan and those who do not.

# convert car, home, personal, and education loans into categorical to use for EDA purposes
 bank$Car_loan <- as.factor(bank$Car_loan)
 bank$Personal_loan <- as.factor(bank$Personal_loan)
 bank$Home_loan <- as.factor(bank$Home_loan)
 bank$Education_loan <- as.factor(bank$Education_loan)
 
ggplot(bank, aes(x=Age, y=Car_loan, fill=Car_loan)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by car loan")

The below figure shows the relationship between Age and personal loan. Based on the graph, the Age ranges look to be similar across all those that have a personal loan or not.

ggplot(bank, aes(x=Age, y=Personal_loan, fill=Personal_loan)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by personal_loan")

The below figure shows the relationship between Age and home loan. Based on the graph, the Age ranges look to be similar across all those that have a home loan or not.

ggplot(bank, aes(x=Age, y=Home_loan, fill=Home_loan)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by home_loan")

The below figure shows the relationship between Age and education loan. Based on the graph, it seems like younger people have an education loan and older people do not.

ggplot(bank, aes(x=Age, y=Education_loan, fill=Education_loan)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by education loan")

5 Two numerical feature variables graphs

Next, we will examine the relationship between two numerical features.

The below graph shows employment duration and credit score. Based on the graph, it seems like regardless of the employment duration, the majority of the credit scores seem to fall in the 700-900 range.

ggplot(data = bank, aes(x = Credit_score, y = Emp_duration)) +
  geom_point() + 
  ggtitle("Employment Duration vs Credit Score")

The below graph shows employment duration and loan amount. There does not seem to be any patterns based on this graph.

ggplot(data = bank, aes(x = Amount, y = Emp_duration)) +
  geom_point() + 
  ggtitle("Employment Duration vs Loan Amount")

The below graph shows employment duration and age. There does not seem to be any patterns based on this graph.

ggplot(data = bank, aes(x = Age, y = Emp_duration)) +
  geom_point() + 
  ggtitle("Employment Duration vs Age")

The below graph shows age and Credit score. Again, it seems like regardless of age, most of the credit scores seem to fall in that 700-900 range.

ggplot(data = bank, aes(x = Age, y = Credit_score)) +
  geom_point() + 
  ggtitle("Age vs credit score")

The below graph shows checking amount and saving amount. There does not seem to be any correlation between these two variables.

ggplot(data = bank, aes(x = Checking_amount, y = Saving_amount)) +
  geom_point() + 
  ggtitle("Checking amount vs savings Amount")

The below graph shows age and loan amount. It seems like regardless of age, most of the loan amounts seem to fall in a certain range.

ggplot(data = bank, aes(x = Age, y = Amount)) +
  geom_point() + 
  ggtitle("age vs Loan Amount")

The below graph shows credit score and loan amount. It seems like most of the points are clumped together.

ggplot(data = bank, aes(x = Credit_score, y = Amount)) +
  geom_point() + 
  ggtitle("credit score vs Loan Amount")

6 Two categorical feature variables graphs

This section shows the relationships between two categorical features.

The below graph shows Gender and Employment status. It seems like the amount of employed and unemployed females are similar but there are significantly more unemployed males than employed.

ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Emp_status), position = "dodge")+ggtitle("gender vs employment status")

The below graph shows Gender and marital status. It seems like all females are single and there are significantly more married males than single.

ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Marital_status), position = "dodge")+ggtitle("gender vs marital status")

The below graph shows employment status and marital status. It seems like the amount of employed and unemployed people are relatively similar but for married people, there are significantly more who are unemployed.

ggplot(bank, aes(Marital_status, ..count..)) + geom_bar(aes(fill = Emp_status), position = "dodge")+ggtitle("marital status vs enployment status")

The below graph shows gender and car loan. It seems like for both males and females, there are significantly more that do not have car loans than those that do.

ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Car_loan), position = "dodge")+ggtitle("gender vs car loan")

The below graph shows gender and education loan. It seems like for both males and females, there are significantly more that do not have education loans than those that do.

ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Education_loan), position = "dodge")+ggtitle("Gender vs education Loan")

The below graph shows gender and personal loan. It seems like for both males and females, the amounts that who have personal loans and those that do not are relatively similar.

ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Personal_loan), position = "dodge")+ggtitle("Gender vs personal Loan")

The below graph shows gender and home loan. It seems like for both males and females, there are significantly more that do not have home loans than those that do.

ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Home_loan), position = "dodge")+ggtitle("Gender vs home Loan")

---
title: "Bank Loan Default Dataset"
author: "Eric Zhu"
date: "2025-02-05"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: no
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---


```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight: bold;
  color: navy;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-family: system-ui;
  font-weight: bold;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 16px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 14px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("readxl")) { # SVM methodology
   install.packages("readxl")
library(readxl)
}
if (!require("ggplot2")) { # SVM methodology
   install.packages("ggplot2")
library(ggplot2)
}
if (!require("ISLR")) { # contains example data set "Khan"
   install.packages("ISLR")
library(ISLR)
}
if (!require("RColorBrewer")) { # customized coloring of plots
   install.packages("RColorBrewer")
library(RColorBrewer)
}
knitr::opts_chunk$set(echo = TRUE,       # include code chunk in the output file
                      warning = FALSE,   # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      results = TRUE,    # you can also decide whether to include the output
                                         # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
```




# Read in Bank Loan Default Dataset

Read in dataset.

```{r}
bank <- read.csv("BankLoanDefaultDataset.csv")
```

Below looks at each variable and see what type they all are.

```{r}
str(bank)
```

Below looks at the top few rows of dataset.

```{r}
head(bank)
```


## Create Missing Values

The original dataset does not have any missing values so I needed to manually create them.

```{r}
gender.missing.id <- sample(1:1000, 20 , replace = FALSE)
Marital.missing.id <- sample(1:1000, 20, replace = FALSE)
emp.status.missing.id <- sample(1:1000, 20, replace = FALSE)
credit.missing.id <- sample(1:1000, 20, replace = FALSE)
amount.missing.id <- sample(1:1000, 20, replace = FALSE)
emp.duration.missing.id <- sample(1:1000, 20, replace = FALSE)
check.amt.missing.id <- sample(1:1000, 20, replace = FALSE)
age.missing.id <- sample(1:1000, 20, replace = FALSE)

bank$Gender[gender.missing.id] <- NA
bank$Marital_status[Marital.missing.id] <- NA
bank$Emp_status[emp.status.missing.id] <- NA
bank$Credit_score[credit.missing.id] <- NA
bank$Amount[amount.missing.id] <- NA
bank$Emp_duration[emp.duration.missing.id] <- NA
bank$Checking_amount[check.amt.missing.id] <- NA
bank$Age[age.missing.id] <- NA



```

# Description and Purpose of Dataset and Analytic Tasks

This dataset is titled "Bank Loan Default Dataset" and shows various explanatory variables and one target/response variable (Default). The purpose of collecting this dataset is to see what factors may cause peoples loans to be in default or not. This dataset was collected from your Course Project Data Repository. This dataset contains 1000 observations and 16 variables (15 feature variables and 1 target variable). The Checking_amount is a numerical variable that shows the amount of money in ones checking account. The Saving_amount is a numerical variable that shows the amount of money in ones saving account. The term (numerical) is the duration of the loan term. The credit_score (numerical) shows ones credit score. The gender (categorical) consists of male and female. The marital_status (categorical) consists of married or single. The car_loan, personal_loan, home_loan, and education_loan (all categorical/binary) shows if people have loans in any of those areas. The emp_status (categorical) consists of unemployed or employed. The amount (numerical) shows the amount of the loan. The emp_duration (num) shows length of employment in months. Age (num) shows age and no_of_credit_account (num) shows number of credit accounts. The overall goal of this project is to see how significant the feature variables are in predicting if ones loan will be in default (1 if default, 0 if not). The first part of the project is doing EDA.

The original dataset did not have any missing values so I needed to manually create them. For this dataset, the variables Gender, Marital_status, Emp_status, credit_score, amount, emp_duration, checking_amount, and age have missing values. Missing numerical values can be resolved by imputing the mean. Missing categorical values can be resolved by imputing the mode.

# Distribution of Individual Features

This section will show the distribution of each individual feature. Some features will have missing values and can be resolved by imputing the mean or mode.

The below figure shows the distribution of the Gender variable. There are significantly more males than females.

```{r}
ggplot(bank, aes(x = Gender)) + 
  
  geom_bar() +
  labs(title = "Gender")
```

The below figure shows the distribution of the Marital_status variable. Married and single people have similar counts.

```{r}
ggplot(bank, aes(x = Marital_status)) + 
  
  geom_bar() +
  labs(title = "Marital_status")
```

The below figure shows the distribution of the Emp_status variable. There are significantly more unemployed than employed.

```{r}
ggplot(bank, aes(x = Emp_status)) + 
  
  geom_bar() +
  labs(title = "Emp_status")
```


The below figure shows the distribution of the car loan variable. There are significantly more that do not have a car loan than those that do.

```{r}
ggplot(bank, aes(x = Car_loan)) + 
  
  geom_bar() +
  labs(title = "Car_loan")
```

The below figure shows the distribution of the personal loan variable. Those that have a personal loan and those that do not have similar counts.

```{r}
ggplot(bank, aes(x = Personal_loan)) + 
  
  geom_bar() +
  labs(title = "Personal_loan")
```

The below figure shows the distribution of the education loan variable. There are significantly more that do not have an education loan than those that do.

```{r}
ggplot(bank, aes(x = Education_loan)) + 
  
  geom_bar() +
  labs(title = "Education_loan")
```

The below figure shows the distribution of the home loan variable. There are significantly more that do not have a home loan than those that do.

```{r}
ggplot(bank, aes(x = Home_loan)) + 
  
  geom_bar() +
  labs(title = "Home_loan")
```

The below figure shows the distribution of the credit score variable. There are no alarming trends outside of a couple anomalies.

```{r}
ggplot(data = bank, aes(x = Credit_score)) + 
  geom_boxplot() + 
  
  labs(title = "Credit_score")
```

The below figure shows the distribution of the checking amount variable. There are no alarming trends.

```{r}
ggplot(data = bank, aes(x = Checking_amount)) + 
  geom_boxplot() + 
  
  labs(title = "Checking_amount")
```

The below figure shows the distribution of the term variable. There are no alarming trends.

```{r}
ggplot(data = bank, aes(x = Term)) + 
  geom_boxplot() + 
  
  labs(title = "Term")

```

The below figure shows the distribution of the amount variable. There are no alarming trends.

```{r}
ggplot(data = bank, aes(x = Amount)) + 
  geom_boxplot() + 
  
  labs(title = "Amount")
```

The below figure shows the distribution of the Saving amount variable. There are no alarming trends.

```{r}
ggplot(data = bank, aes(x = Saving_amount)) + 
  geom_boxplot() + 
  
  labs(title = "Saving amount")
```

The below figure shows the distribution of the Emp_duration variable. There are no alarming trends.

```{r}
ggplot(data = bank, aes(x = Emp_duration)) + 
  geom_boxplot() + 
  
  labs(title = "Emp duration")
```

The below figure shows the distribution of the age variable. There are no alarming trends.

```{r}
ggplot(data = bank, aes(x = Age)) + 
  geom_boxplot() + 
  
  labs(title = "Age")
```

The below figure shows the distribution of the No_of_credit_acc variable. This variable seems to be heavily skewed.

```{r}
ggplot(data = bank, aes(x = No_of_credit_acc)) + 
  geom_boxplot() + 
  
  labs(title = "No_of_credit_acc")
```

# One Categorical Feature and One Numerical feature Graphs

In this section, we will show the relationship between one categorical feature and one numerical feature.

The below figure shows the relationship between credit score and gender. Based on the graph, the credit score ranges look to be similar across both genders. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Credit_score, y=Gender, fill=Gender)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Gender")
```

The below figure shows the relationship between loan amount and gender. Based on the graph, the loan amount ranges look to be similar across both genders. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Amount, y=Gender, fill=Gender)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Gender")
```

The below figure shows the relationship between Employment duration and gender. Based on the graph, it seems like males have a longer employment duration than females. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Emp_duration, y=Gender, fill=Gender)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by Gender")
```

The below figure shows the relationship between Age and gender. Based on the graph, the Age ranges look to be similar across both genders. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Age, y=Gender, fill=Gender)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Age by Gender")
```

The below figure shows the relationship between marital status and credit score. Based on the graph, the credit score ranges look to be similar across both single and married people. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Credit_score, y=Marital_status, fill=Marital_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Marital status")
```

The below figure shows the relationship between loan amount and marital status. Based on the graph, the loan amount ranges look to be similar across both marital statuses. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Amount, y=Marital_status, fill=Marital_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Marital status")
```

The below figure shows the relationship between marital status and employment duration. Based on the graph, it seems like married people have longer employment duration than single people. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Emp_duration, y=Marital_status, fill=Marital_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by Marital status")
```

The below figure shows the relationship between Age and marital status. Based on the graph, the Age ranges look to be similar across both marital statuses. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Age, y=Marital_status, fill=Marital_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by Marital status")
```

The below figure shows the relationship between credit score and employment status. Based on the graph, the credit score ranges look to be similar across both employed and unemployed people. There seems to be more anomalies for unemployed people but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Credit_score, y=Emp_status, fill=Emp_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Credit Score by Employment status")
```

The below figure shows the relationship between loan amount and employment status. Based on the graph, the loan amount ranges look to be similar across both employed/unemployed. Again there seems to be more anomalies for unemployed people than employed. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Amount, y=Emp_status, fill=Emp_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Amount by Employment status")
```

The below figure shows the relationship between employment duration and employment status. Based on the graph, it seems like unemployed people have longer and more varied employment duration than employed people. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Emp_duration, y=Emp_status, fill=Emp_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("Employment duration by employment status")
```

The below figure shows the relationship between Age and employment status. Based on the graph, the Age ranges look to be similar across both statuses. There are some anomalies but I do not believe that they will have a significant effect on the analysis. There are missing values but they can be resolved using imputation.

```{r}
ggplot(bank, aes(x=Age, y=Emp_status, fill=Emp_status)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by employment status")
```

The below figure shows the relationship between Age and car loan. Based on the graph, the Age ranges look to be similar across both those who own a car loan and those who do not.

```{r}
# convert car, home, personal, and education loans into categorical to use for EDA purposes
 bank$Car_loan <- as.factor(bank$Car_loan)
 bank$Personal_loan <- as.factor(bank$Personal_loan)
 bank$Home_loan <- as.factor(bank$Home_loan)
 bank$Education_loan <- as.factor(bank$Education_loan)
 
ggplot(bank, aes(x=Age, y=Car_loan, fill=Car_loan)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by car loan")
```

The below figure shows the relationship between Age and personal loan. Based on the graph, the Age ranges look to be similar across all those that have a personal loan or not.

```{r}
ggplot(bank, aes(x=Age, y=Personal_loan, fill=Personal_loan)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by personal_loan")
```

The below figure shows the relationship between Age and home loan. Based on the graph, the Age ranges look to be similar across all those that have a home loan or not.

```{r}
ggplot(bank, aes(x=Age, y=Home_loan, fill=Home_loan)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by home_loan")
```

The below figure shows the relationship between Age and education loan. Based on the graph, it seems like younger people have an education loan and older people do not.
```{r}
ggplot(bank, aes(x=Age, y=Education_loan, fill=Education_loan)) +
  geom_boxplot() + theme(legend.position="none")+
ggtitle("age by education loan")
```

# Two numerical feature variables graphs

Next, we will examine the relationship between two numerical features.

The below graph shows employment duration and credit score. Based on the graph, it seems like regardless of the employment duration, the majority of the credit scores seem to fall in the 700-900 range.

```{r}
ggplot(data = bank, aes(x = Credit_score, y = Emp_duration)) +
  geom_point() + 
  ggtitle("Employment Duration vs Credit Score")
```

The below graph shows employment duration and loan amount. There does not seem to be any patterns based on this graph.

```{r}
ggplot(data = bank, aes(x = Amount, y = Emp_duration)) +
  geom_point() + 
  ggtitle("Employment Duration vs Loan Amount")
```

The below graph shows employment duration and age. There does not seem to be any patterns based on this graph.

```{r}
ggplot(data = bank, aes(x = Age, y = Emp_duration)) +
  geom_point() + 
  ggtitle("Employment Duration vs Age")
```

The below graph shows age and Credit score. Again, it seems like regardless of age, most of the credit scores seem to fall in that 700-900 range.

```{r}
ggplot(data = bank, aes(x = Age, y = Credit_score)) +
  geom_point() + 
  ggtitle("Age vs credit score")
```

The below graph shows checking amount and saving amount. There does not seem to be any correlation between these two variables.

```{r}
ggplot(data = bank, aes(x = Checking_amount, y = Saving_amount)) +
  geom_point() + 
  ggtitle("Checking amount vs savings Amount")
```

The below graph shows age and loan amount. It seems like regardless of age, most of the loan amounts seem to fall in a certain range.

```{r}
ggplot(data = bank, aes(x = Age, y = Amount)) +
  geom_point() + 
  ggtitle("age vs Loan Amount")
```

The below graph shows credit score and loan amount. It seems like most of the points are clumped together.

```{r}
ggplot(data = bank, aes(x = Credit_score, y = Amount)) +
  geom_point() + 
  ggtitle("credit score vs Loan Amount")
```


# Two categorical feature variables graphs

This section shows the relationships between two categorical features.

The below graph shows Gender and Employment status. It seems like the amount of employed and unemployed females are similar but there are significantly more unemployed males than employed.

```{r}
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Emp_status), position = "dodge")+ggtitle("gender vs employment status")
```

The below graph shows Gender and marital status. It seems like all females are single and there are significantly more married males than single.

```{r}
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Marital_status), position = "dodge")+ggtitle("gender vs marital status")
```

The below graph shows employment status and marital status. It seems like the amount of employed and unemployed people are relatively similar but for married people, there are significantly more who are unemployed.

```{r}
ggplot(bank, aes(Marital_status, ..count..)) + geom_bar(aes(fill = Emp_status), position = "dodge")+ggtitle("marital status vs enployment status")
```

The below graph shows gender and car loan. It seems like for both males and females, there are significantly more that do not have car loans than those that do.

```{r}
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Car_loan), position = "dodge")+ggtitle("gender vs car loan")
```

The below graph shows gender and education loan. It seems like for both males and females, there are significantly more that do not have education loans than those that do.

```{r}
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Education_loan), position = "dodge")+ggtitle("Gender vs education Loan")
```

The below graph shows gender and personal loan. It seems like for both males and females, the amounts that who have personal loans and those that do not are relatively similar.

```{r}
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Personal_loan), position = "dodge")+ggtitle("Gender vs personal Loan")
```

The below graph shows gender and home loan. It seems like for both males and females, there are significantly more that do not have home loans than those that do.

```{r}
ggplot(bank, aes(Gender, ..count..)) + geom_bar(aes(fill = Home_loan), position = "dodge")+ggtitle("Gender vs home Loan")

```
