1 Introduction

In this assignment, we will look at a data set which observed bank loan data. We will conduct exploratory data analysis steps in order to analyze and prepare this data for further analysis. We will address concerns such as potential missing values, along with creating and redefining some of the variables in the original data set to make them more meaningful and relevant for future analysis.

1.1 Data Description

This data set was collected from the U.S. Small Business Administration (SBA) over the span of the years from 1987 through 2014. This data set contains 899,164 observations of 27 different variables. Each of these observations represent a loan which was guaranteed by the SBA.

2 Exploratory Data Analysis

First, we will read in the combined bank loan data sets.

loan01 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
var.names = names(loan)
my.var = var.names[c(1,6,21,26)]
my.new.data = loan[1:15, my.var]
dim(my.new.data)
[1] 15  4

2.1 Addressing the Missing Values

This data set contains some missing values and so we must address these before continuing any further with our analysis of this data. For this project, we will delete all of the observations which had a missing value of MIS_Status.

colSums(is.na(loan))
    LoanNr_ChkDgt              Name              City             State 
                0                 5                 0                 0 
              Zip              Bank         BankState             NAICS 
                0                 0                 0                 0 
     ApprovalDate        ApprovalFY              Term             NoEmp 
                0                 0                 0                 0 
         NewExist         CreateJob       RetainedJob     FranchiseCode 
              136                 0                 0                 0 
       UrbanRural         RevLineCr            LowDoc        ChgOffDate 
                0                 0                 0                 0 
 DisbursementDate DisbursementGross      BalanceGross        MIS_Status 
                0                 0                 0                 0 
     ChgOffPrinGr            GrAppv          SBA_Appv 
                0                 0                 0 

It turns out that the MIS_Status variable has zero missing observations, so we do not have to delete any observations based upon a missing value of MIS_Status.

In fact, the only variable with any missing observations is the NewExist variable. We will go ahead and delete these missing observations instead since the MIS_Status variable was all good. There are 136 missing observations of this NewExist variable and we will delete these missing observations.

We will create a new data set called “loan.noMissing” to represent that it has no missing observations.

loan.noMissing <- na.omit(loan)

Let’s double check that we successfully removed all observations with missing values from the data set.

colSums(is.na(loan.noMissing))
    LoanNr_ChkDgt              Name              City             State 
                0                 0                 0                 0 
              Zip              Bank         BankState             NAICS 
                0                 0                 0                 0 
     ApprovalDate        ApprovalFY              Term             NoEmp 
                0                 0                 0                 0 
         NewExist         CreateJob       RetainedJob     FranchiseCode 
                0                 0                 0                 0 
       UrbanRural         RevLineCr            LowDoc        ChgOffDate 
                0                 0                 0                 0 
 DisbursementDate DisbursementGross      BalanceGross        MIS_Status 
                0                 0                 0                 0 
     ChgOffPrinGr            GrAppv          SBA_Appv 
                0                 0                 0 

As we can see, no variables have any more missing observations, so we have successfully dealt with the missing value concern for our data set.

2.2 Variable Formatting

We will now change all of the currency related variables in this combined bank loan data set to be regular, numeric variables. These currency related variables include DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv. To do this, we will remove the dollar sign and the commas from the observations of these currency related variables.

currency <- c("DisbursementGross", "BalanceGross", "ChgOffPrinGr", "GrAppv", "SBA_Appv")

loan.updated <- loan.noMissing %>%
  mutate(across(all_of(currency), ~ as.numeric(gsub("[$,]", "", .))))

Now, all of the currency related variables have been updated to be represented as standard, numeric variables without commas or a dollar sign. The new, updated variable data has been stored in a new data set called “loan.updated”.

We can view a summary of these updated, numeric variables on currency to ensure that these changes to remove the commas and dollar signs properly worked.

summary(loan.updated[currency])
 DisbursementGross   BalanceGross     ChgOffPrinGr         GrAppv       
 Min.   :       0   Min.   :     0   Min.   :      0   Min.   :    200  
 1st Qu.:   42000   1st Qu.:     0   1st Qu.:      0   1st Qu.:  35000  
 Median :  100000   Median :     0   Median :      0   Median :  90000  
 Mean   :  201142   Mean   :     3   Mean   :  13505   Mean   : 192676  
 3rd Qu.:  238000   3rd Qu.:     0   3rd Qu.:      0   3rd Qu.: 225000  
 Max.   :11446325   Max.   :996262   Max.   :3512596   Max.   :5472000  
    SBA_Appv      
 Min.   :    100  
 1st Qu.:  21250  
 Median :  61285  
 Mean   : 149480  
 3rd Qu.: 175000  
 Max.   :5472000  

As we can see, the values of all of these currency variables are now given as standard, numeric variables.

2.3 Combining Sparse Categories

Now, we will take a look at one of the categorical variables from this combined bank loan data set and combine its sparse categories in a way that is meaningful. This will allow us to have new categories that are fewer in number, and provide better meaningfulness for the interpretation of this variable.

We will look at the categorical variable “NAICS”. This variable looks at the North American Industry Classification System code of each observation. This variable defines each observation by the specific industry of which it is relevant to.

knitr::include_graphics("NAICS.png")

We will combined the observations of the NAICS variable into categories based upon their official classification.

loan.updated<- loan.updated %>%
  mutate(NAICS.combined = case_when(
    NAICS %in% c("11") ~ "Agriculture",
    NAICS %in% c("21") ~ "Mining",
    NAICS %in% c("22") ~ "Utilities",
    NAICS %in% c("23") ~ "Construction",
    NAICS %in% c("31", "32", "33") ~ "Manufacturing",
    NAICS %in% c("42") ~ "Wholesale Trade",
    NAICS %in% c("48", "49") ~ "Transportation",
    NAICS %in% c("51") ~ "Information",
    NAICS %in% c("52") ~ "Finance",
    NAICS %in% c("53") ~ "Real Estate",
    NAICS %in% c("54") ~ "Professional Services",
    NAICS %in% c("55") ~ "Management",
    NAICS %in% c("56") ~ "Administrative Support",
    NAICS %in% c("61") ~ "Educational Services",
    NAICS %in% c("62") ~ "Healthcare",
    NAICS %in% c("71") ~ "Arts",
    NAICS %in% c("72") ~ "Food Services",
    NAICS %in% c("81") ~ "Other Services",
    NAICS %in% c("92") ~ "Public Administration"
  )
)

2.4 Calculating the Default Rates

Next, we will calculate the default rates of the NAICS.combined categorical variable which we created in the previous part.

For this step, the value CHGOFF in MIS_Status is defined to be loan default. The default rate will be the percentage of CHGOFF in MIS_Status.

default.rates <- loan.updated %>%
  group_by(NAICS.combined) %>%
  summarise(total = n(),
            defaults = sum(MIS_Status == "CHGOFF", na.rm = TRUE),
            default.rates = defaults / total)

We can view the values of the default rates which we calculated.

default.rates
# A tibble: 1 × 4
  NAICS.combined  total defaults default.rates
  <chr>           <int>    <int>         <dbl>
1 <NA>           899023   157543         0.175

Now, we have successfully calculated the default rates for our NAICS.combined variable which we created to group together the sparse categories.

2.5 Discretizing the GrAppv Variable

We will discretize the variable of GrAppv into five categories. This variable represents the gross amount of the loan that has been approved by the bank.

We will create these categories based upon the overall level of he value of the GrAppv value. These subcategories will split the GrAppv variable into five levels, “lowest”, “low”, “medium”, “high”, and “highest” based upon the value of the gross amount of loan that has been approved by the ban.

First, let’s look at a quick summary of the original GrAppv variable.

summary(loan.updated$GrAppv)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    200   35000   90000  192676  225000 5472000 

As we can see, the minimum value of the variable is $200, and the maximum value is $5,472,000. The median value is $90,000 indicating that the maximum value differs much more greatly from the median value than the minimum value does.

We will split the GrAppv variable into five subcategories based upon the value of each observation. We will create a new variable called “GrAppv.dis” to represent the discretized version of the original GrAppv variable.

loan.updated$GrAppv.dis <- cut(loan.updated$GrAppv,
                                   breaks = 5,
                                   labels = c("Lowest", "Low", "Medium", "High", "Highest"))

Now we have five subcategories of the GrAppv variable split up from the lowest to the highest.

2.6 Density Curves

Lastly, we will draw the density curves of SBA_Appv for each of the five sub-populations that were created in the previous part. We will place these five density curves onto the same plot.

ggplot(loan.updated, aes(x = SBA_Appv, fill = GrAppv.dis)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Curves of SBA_Appv for the \n Five SubCategories of GrAppv",
       x = "SBA_Appv", y = "Density") 

In the graph, we can see the density curves of the SBA_Appv variable for each of the five categories of GrAppv that we created in the previous part. We can see that these five categories have been color-coded by their respective rank of “lowest”, “low”, “medium”, “high”, “or highest”.

3 Conclusion

We have completed the desired exploratory data analysis steps including dealing with missing values as well as updating some of the variables within the original combined bank loan data set. We now have a new data set called “loan.updated” with all of the desired updates to the data set.

---
title: "EDA with the Combined Bank Loan Data Set"
author: "Josie Gallop"
date: "2025-02-24"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
if(!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if(!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if(!require("usdm")) {
   install.packages("usdm")
   library(usdm)
}
if(!require("car")) {
   install.packages("car")
   library(car)
}
if (!require("boot")) {
   install.packages("boot")
   library(boot)
}
if(!require("pander")) {
   install.packages("pander")
   library(pander)
}
if(!require("mice")) {
   install.packages("mice")
   library(mice)
}
if(!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if(!require("psych")) {
   install.packages("psych")
   library(psych)
}
if(!require("broom.mixed")) {
   install.packages("broom.mixed")
   library(broom.mixed)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if(!require("caret")) {
   install.packages("caret")
   library(caret)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
# Specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,  # include code chunk in the output                                           file
                   warning = FALSE,  # Sometimes, your code may produce a                                         warning
                                     # messages, you can choose to include                                        the
                                     # warning messages in the output file. 
                   message = FALSE,  
                   results = TRUE,   # you can also decide whether to                                             include 
                                     # the output in the output file.
                   comment = NA      # Suppress hash-tags in the output                                           results.
                      )   
```


# Introduction

In this assignment, we will look at a data set which observed bank loan data. We will conduct exploratory data analysis steps in order to analyze and prepare this data for further analysis. We will address concerns such as potential missing values, along with creating and redefining some of the variables in the original data set to make them more meaningful and relevant for future analysis. 


## Data Description

This data set was collected from the U.S. Small Business Administration (SBA) over the span of the years from 1987 through 2014. This data set contains 899,164 observations of 27 different variables. Each of these observations represent a loan which was guaranteed by the SBA. 



# Exploratory Data Analysis


First, we will read in the combined bank loan data sets.

```{r}
loan01 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
```

```{r}
var.names = names(loan)
my.var = var.names[c(1,6,21,26)]
my.new.data = loan[1:15, my.var]
dim(my.new.data)
```



## Addressing the Missing Values

This data set contains some missing values and so we must address these before continuing any further with our analysis of this data. For this project, we will delete all of the observations which had a missing value of MIS_Status. 

```{r}
colSums(is.na(loan))
```

It turns out that the MIS_Status variable has zero missing observations, so we do not have to delete any observations based upon a missing value of MIS_Status. 

In fact, the only variable with any missing observations is the NewExist variable. We will go ahead and delete these missing observations instead since the MIS_Status variable was all good. There are 136 missing observations of this NewExist variable and we will delete these missing observations. 

We will create a new data set called "loan.noMissing" to represent that it has no missing observations.

```{r}
loan.noMissing <- na.omit(loan)
```

Let's double check that we successfully removed all observations with missing values from the data set.

```{r}
colSums(is.na(loan.noMissing))
```

As we can see, no variables have any more missing observations, so we have successfully dealt with the missing value concern for our data set. 



## Variable Formatting

We will now change all of the currency related variables in this combined bank loan data set to be regular, numeric variables. These currency related variables include DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv. To do this, we will remove the dollar sign and the commas from the observations of these currency related variables. 

```{r}
currency <- c("DisbursementGross", "BalanceGross", "ChgOffPrinGr", "GrAppv", "SBA_Appv")

loan.updated <- loan.noMissing %>%
  mutate(across(all_of(currency), ~ as.numeric(gsub("[$,]", "", .))))
```

Now, all of the currency related variables have been updated to be represented as standard, numeric variables without commas or a dollar sign. The new, updated variable data has been stored in a new data set called "loan.updated". 

We can view a summary of these updated, numeric variables on currency to ensure that these changes to remove the commas and dollar signs properly worked. 

```{r}
summary(loan.updated[currency])
```

As we can see, the values of all of these currency variables are now given as standard, numeric variables. 




## Combining Sparse Categories

Now, we will take a look at one of the categorical variables from this combined bank loan data set and combine its sparse categories in a way that is meaningful. This will allow us to have new categories that are fewer in number, and provide better meaningfulness for the interpretation of this variable.

We will look at the categorical variable "NAICS". This variable looks at the North American Industry Classification System code of each observation. This variable defines each observation by the specific industry of which it is relevant to. 

```{r}
knitr::include_graphics("NAICS.png")
```

We will combined the observations of the NAICS variable into categories based upon their official classification.

```{r}
loan.updated<- loan.updated %>%
  mutate(NAICS.combined = case_when(
    NAICS %in% c("11") ~ "Agriculture",
    NAICS %in% c("21") ~ "Mining",
    NAICS %in% c("22") ~ "Utilities",
    NAICS %in% c("23") ~ "Construction",
    NAICS %in% c("31", "32", "33") ~ "Manufacturing",
    NAICS %in% c("42") ~ "Wholesale Trade",
    NAICS %in% c("48", "49") ~ "Transportation",
    NAICS %in% c("51") ~ "Information",
    NAICS %in% c("52") ~ "Finance",
    NAICS %in% c("53") ~ "Real Estate",
    NAICS %in% c("54") ~ "Professional Services",
    NAICS %in% c("55") ~ "Management",
    NAICS %in% c("56") ~ "Administrative Support",
    NAICS %in% c("61") ~ "Educational Services",
    NAICS %in% c("62") ~ "Healthcare",
    NAICS %in% c("71") ~ "Arts",
    NAICS %in% c("72") ~ "Food Services",
    NAICS %in% c("81") ~ "Other Services",
    NAICS %in% c("92") ~ "Public Administration"
  )
)
```






## Calculating the Default Rates

Next, we will calculate the default rates of the NAICS.combined categorical variable which we created in the previous part. 

For this step, the value CHGOFF in MIS_Status is defined to be loan default. The default rate will be the percentage of CHGOFF in MIS_Status.

```{r}
default.rates <- loan.updated %>%
  group_by(NAICS.combined) %>%
  summarise(total = n(),
            defaults = sum(MIS_Status == "CHGOFF", na.rm = TRUE),
            default.rates = defaults / total)
```

We can view the values of the default rates which we calculated. 

```{r}
default.rates
```

Now, we have successfully calculated the default rates for our NAICS.combined variable which we created to group together the sparse categories. 




## Discretizing the GrAppv Variable

We will discretize  the variable of GrAppv into five categories. This variable represents the gross amount of the loan that has been approved by the bank.

We will create these categories based upon the overall level of he value of the GrAppv value. These subcategories will split the GrAppv variable into five levels, "lowest", "low", "medium", "high", and "highest" based upon the value of the gross amount of loan that has been approved by the ban.

First, let's look at a quick summary of the original GrAppv variable.

```{r}
summary(loan.updated$GrAppv)
```

As we can see, the minimum value of the variable is $200, and the maximum value is $5,472,000. The median value is $90,000 indicating that the maximum value differs much more greatly from the median value than the minimum value does. 

We will split the GrAppv variable into five subcategories based upon the value of each observation. We will create a new variable called "GrAppv.dis" to represent the discretized version of the original GrAppv variable. 

```{r}
loan.updated$GrAppv.dis <- cut(loan.updated$GrAppv,
                                   breaks = 5,
                                   labels = c("Lowest", "Low", "Medium", "High", "Highest"))
```

Now we have five subcategories of the GrAppv variable split up from the lowest to the highest. 




## Density Curves

Lastly, we will draw the density curves of SBA_Appv for each of the five sub-populations that were created in the previous part. We will place these five density curves onto the same plot. 

```{r}
ggplot(loan.updated, aes(x = SBA_Appv, fill = GrAppv.dis)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Curves of SBA_Appv for the \n Five SubCategories of GrAppv",
       x = "SBA_Appv", y = "Density") 
```

In the graph, we can see the density curves of the SBA_Appv variable for each of the five categories of GrAppv that we created in the previous part. We can see that these five categories have been color-coded by their respective rank of "lowest", "low", "medium", "high", "or highest".


# Conclusion

We have completed the desired exploratory data analysis steps including dealing with missing values as well as updating some of the variables within the original combined bank loan data set. We now have a new data set called "loan.updated" with all of the desired updates to the data set. 
