Exploratory Data
Analysis
First, we will read in the combined bank loan data sets.
loan01 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
var.names = names(loan)
my.var = var.names[c(1,6,21,26)]
my.new.data = loan[1:15, my.var]
dim(my.new.data)
[1] 15 4
Addressing the
Missing Values
This data set contains some missing values and so we must address
these before continuing any further with our analysis of this data. For
this project, we will delete all of the observations which had a missing
value of MIS_Status.
colSums(is.na(loan))
LoanNr_ChkDgt Name City State
0 5 0 0
Zip Bank BankState NAICS
0 0 0 0
ApprovalDate ApprovalFY Term NoEmp
0 0 0 0
NewExist CreateJob RetainedJob FranchiseCode
136 0 0 0
UrbanRural RevLineCr LowDoc ChgOffDate
0 0 0 0
DisbursementDate DisbursementGross BalanceGross MIS_Status
0 0 0 0
ChgOffPrinGr GrAppv SBA_Appv
0 0 0
It turns out that the MIS_Status variable has zero missing
observations, so we do not have to delete any observations based upon a
missing value of MIS_Status.
In fact, the only variable with any missing observations is the
NewExist variable. We will go ahead and delete these missing
observations instead since the MIS_Status variable was all good. There
are 136 missing observations of this NewExist variable and we will
delete these missing observations.
We will create a new data set called “loan.noMissing” to represent
that it has no missing observations.
loan.noMissing <- na.omit(loan)
Let’s double check that we successfully removed all observations with
missing values from the data set.
colSums(is.na(loan.noMissing))
LoanNr_ChkDgt Name City State
0 0 0 0
Zip Bank BankState NAICS
0 0 0 0
ApprovalDate ApprovalFY Term NoEmp
0 0 0 0
NewExist CreateJob RetainedJob FranchiseCode
0 0 0 0
UrbanRural RevLineCr LowDoc ChgOffDate
0 0 0 0
DisbursementDate DisbursementGross BalanceGross MIS_Status
0 0 0 0
ChgOffPrinGr GrAppv SBA_Appv
0 0 0
As we can see, no variables have any more missing observations, so we
have successfully dealt with the missing value concern for our data
set.
Variable
Formatting
We will now change all of the currency related variables in this
combined bank loan data set to be regular, numeric variables. These
currency related variables include DisbursementGross, BalanceGross,
ChgOffPrinGr, GrAppv, and SBA_Appv. To do this, we will remove the
dollar sign and the commas from the observations of these currency
related variables.
currency <- c("DisbursementGross", "BalanceGross", "ChgOffPrinGr", "GrAppv", "SBA_Appv")
loan.updated <- loan.noMissing %>%
mutate(across(all_of(currency), ~ as.numeric(gsub("[$,]", "", .))))
Now, all of the currency related variables have been updated to be
represented as standard, numeric variables without commas or a dollar
sign. The new, updated variable data has been stored in a new data set
called “loan.updated”.
We can view a summary of these updated, numeric variables on currency
to ensure that these changes to remove the commas and dollar signs
properly worked.
summary(loan.updated[currency])
DisbursementGross BalanceGross ChgOffPrinGr GrAppv
Min. : 0 Min. : 0 Min. : 0 Min. : 200
1st Qu.: 42000 1st Qu.: 0 1st Qu.: 0 1st Qu.: 35000
Median : 100000 Median : 0 Median : 0 Median : 90000
Mean : 201142 Mean : 3 Mean : 13505 Mean : 192676
3rd Qu.: 238000 3rd Qu.: 0 3rd Qu.: 0 3rd Qu.: 225000
Max. :11446325 Max. :996262 Max. :3512596 Max. :5472000
SBA_Appv
Min. : 100
1st Qu.: 21250
Median : 61285
Mean : 149480
3rd Qu.: 175000
Max. :5472000
As we can see, the values of all of these currency variables are now
given as standard, numeric variables.
Combining Sparse
Categories
Now, we will take a look at one of the categorical variables from
this combined bank loan data set and combine its sparse categories in a
way that is meaningful. This will allow us to have new categories that
are fewer in number, and provide better meaningfulness for the
interpretation of this variable.
We will look at the categorical variable “NAICS”. This variable looks
at the North American Industry Classification System code of each
observation. This variable defines each observation by the specific
industry of which it is relevant to.
knitr::include_graphics("NAICS.png")

We will combined the observations of the NAICS variable into
categories based upon their official classification.
loan.updated<- loan.updated %>%
mutate(NAICS.combined = case_when(
NAICS %in% c("11") ~ "Agriculture",
NAICS %in% c("21") ~ "Mining",
NAICS %in% c("22") ~ "Utilities",
NAICS %in% c("23") ~ "Construction",
NAICS %in% c("31", "32", "33") ~ "Manufacturing",
NAICS %in% c("42") ~ "Wholesale Trade",
NAICS %in% c("48", "49") ~ "Transportation",
NAICS %in% c("51") ~ "Information",
NAICS %in% c("52") ~ "Finance",
NAICS %in% c("53") ~ "Real Estate",
NAICS %in% c("54") ~ "Professional Services",
NAICS %in% c("55") ~ "Management",
NAICS %in% c("56") ~ "Administrative Support",
NAICS %in% c("61") ~ "Educational Services",
NAICS %in% c("62") ~ "Healthcare",
NAICS %in% c("71") ~ "Arts",
NAICS %in% c("72") ~ "Food Services",
NAICS %in% c("81") ~ "Other Services",
NAICS %in% c("92") ~ "Public Administration"
)
)
Calculating the
Default Rates
Next, we will calculate the default rates of the NAICS.combined
categorical variable which we created in the previous part.
For this step, the value CHGOFF in MIS_Status is defined to be loan
default. The default rate will be the percentage of CHGOFF in
MIS_Status.
default.rates <- loan.updated %>%
group_by(NAICS.combined) %>%
summarise(total = n(),
defaults = sum(MIS_Status == "CHGOFF", na.rm = TRUE),
default.rates = defaults / total)
We can view the values of the default rates which we calculated.
default.rates
# A tibble: 1 × 4
NAICS.combined total defaults default.rates
<chr> <int> <int> <dbl>
1 <NA> 899023 157543 0.175
Now, we have successfully calculated the default rates for our
NAICS.combined variable which we created to group together the sparse
categories.
Discretizing the
GrAppv Variable
We will discretize the variable of GrAppv into five categories. This
variable represents the gross amount of the loan that has been approved
by the bank.
We will create these categories based upon the overall level of he
value of the GrAppv value. These subcategories will split the GrAppv
variable into five levels, “lowest”, “low”, “medium”, “high”, and
“highest” based upon the value of the gross amount of loan that has been
approved by the ban.
First, let’s look at a quick summary of the original GrAppv
variable.
summary(loan.updated$GrAppv)
Min. 1st Qu. Median Mean 3rd Qu. Max.
200 35000 90000 192676 225000 5472000
As we can see, the minimum value of the variable is $200, and the
maximum value is $5,472,000. The median value is $90,000 indicating that
the maximum value differs much more greatly from the median value than
the minimum value does.
We will split the GrAppv variable into five subcategories based upon
the value of each observation. We will create a new variable called
“GrAppv.dis” to represent the discretized version of the original GrAppv
variable.
loan.updated$GrAppv.dis <- cut(loan.updated$GrAppv,
breaks = 5,
labels = c("Lowest", "Low", "Medium", "High", "Highest"))
Now we have five subcategories of the GrAppv variable split up from
the lowest to the highest.
Density Curves
Lastly, we will draw the density curves of SBA_Appv for each of the
five sub-populations that were created in the previous part. We will
place these five density curves onto the same plot.
ggplot(loan.updated, aes(x = SBA_Appv, fill = GrAppv.dis)) +
geom_density(alpha = 0.5) +
labs(title = "Density Curves of SBA_Appv for the \n Five SubCategories of GrAppv",
x = "SBA_Appv", y = "Density")

In the graph, we can see the density curves of the SBA_Appv variable
for each of the five categories of GrAppv that we created in the
previous part. We can see that these five categories have been
color-coded by their respective rank of “lowest”, “low”, “medium”,
“high”, “or highest”.
---
title: "EDA with the Combined Bank Loan Data Set"
author: "Josie Gallop"
date: "2025-02-24"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
if(!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if(!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if(!require("usdm")) {
   install.packages("usdm")
   library(usdm)
}
if(!require("car")) {
   install.packages("car")
   library(car)
}
if (!require("boot")) {
   install.packages("boot")
   library(boot)
}
if(!require("pander")) {
   install.packages("pander")
   library(pander)
}
if(!require("mice")) {
   install.packages("mice")
   library(mice)
}
if(!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if(!require("psych")) {
   install.packages("psych")
   library(psych)
}
if(!require("broom.mixed")) {
   install.packages("broom.mixed")
   library(broom.mixed)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if(!require("caret")) {
   install.packages("caret")
   library(caret)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
# Specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,  # include code chunk in the output                                           file
                   warning = FALSE,  # Sometimes, your code may produce a                                         warning
                                     # messages, you can choose to include                                        the
                                     # warning messages in the output file. 
                   message = FALSE,  
                   results = TRUE,   # you can also decide whether to                                             include 
                                     # the output in the output file.
                   comment = NA      # Suppress hash-tags in the output                                           results.
                      )   
```


# Introduction

In this assignment, we will look at a data set which observed bank loan data. We will conduct exploratory data analysis steps in order to analyze and prepare this data for further analysis. We will address concerns such as potential missing values, along with creating and redefining some of the variables in the original data set to make them more meaningful and relevant for future analysis. 


## Data Description

This data set was collected from the U.S. Small Business Administration (SBA) over the span of the years from 1987 through 2014. This data set contains 899,164 observations of 27 different variables. Each of these observations represent a loan which was guaranteed by the SBA. 



# Exploratory Data Analysis


First, we will read in the combined bank loan data sets.

```{r}
loan01 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
```

```{r}
var.names = names(loan)
my.var = var.names[c(1,6,21,26)]
my.new.data = loan[1:15, my.var]
dim(my.new.data)
```



## Addressing the Missing Values

This data set contains some missing values and so we must address these before continuing any further with our analysis of this data. For this project, we will delete all of the observations which had a missing value of MIS_Status. 

```{r}
colSums(is.na(loan))
```

It turns out that the MIS_Status variable has zero missing observations, so we do not have to delete any observations based upon a missing value of MIS_Status. 

In fact, the only variable with any missing observations is the NewExist variable. We will go ahead and delete these missing observations instead since the MIS_Status variable was all good. There are 136 missing observations of this NewExist variable and we will delete these missing observations. 

We will create a new data set called "loan.noMissing" to represent that it has no missing observations.

```{r}
loan.noMissing <- na.omit(loan)
```

Let's double check that we successfully removed all observations with missing values from the data set.

```{r}
colSums(is.na(loan.noMissing))
```

As we can see, no variables have any more missing observations, so we have successfully dealt with the missing value concern for our data set. 



## Variable Formatting

We will now change all of the currency related variables in this combined bank loan data set to be regular, numeric variables. These currency related variables include DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv. To do this, we will remove the dollar sign and the commas from the observations of these currency related variables. 

```{r}
currency <- c("DisbursementGross", "BalanceGross", "ChgOffPrinGr", "GrAppv", "SBA_Appv")

loan.updated <- loan.noMissing %>%
  mutate(across(all_of(currency), ~ as.numeric(gsub("[$,]", "", .))))
```

Now, all of the currency related variables have been updated to be represented as standard, numeric variables without commas or a dollar sign. The new, updated variable data has been stored in a new data set called "loan.updated". 

We can view a summary of these updated, numeric variables on currency to ensure that these changes to remove the commas and dollar signs properly worked. 

```{r}
summary(loan.updated[currency])
```

As we can see, the values of all of these currency variables are now given as standard, numeric variables. 




## Combining Sparse Categories

Now, we will take a look at one of the categorical variables from this combined bank loan data set and combine its sparse categories in a way that is meaningful. This will allow us to have new categories that are fewer in number, and provide better meaningfulness for the interpretation of this variable.

We will look at the categorical variable "NAICS". This variable looks at the North American Industry Classification System code of each observation. This variable defines each observation by the specific industry of which it is relevant to. 

```{r}
knitr::include_graphics("NAICS.png")
```

We will combined the observations of the NAICS variable into categories based upon their official classification.

```{r}
loan.updated<- loan.updated %>%
  mutate(NAICS.combined = case_when(
    NAICS %in% c("11") ~ "Agriculture",
    NAICS %in% c("21") ~ "Mining",
    NAICS %in% c("22") ~ "Utilities",
    NAICS %in% c("23") ~ "Construction",
    NAICS %in% c("31", "32", "33") ~ "Manufacturing",
    NAICS %in% c("42") ~ "Wholesale Trade",
    NAICS %in% c("48", "49") ~ "Transportation",
    NAICS %in% c("51") ~ "Information",
    NAICS %in% c("52") ~ "Finance",
    NAICS %in% c("53") ~ "Real Estate",
    NAICS %in% c("54") ~ "Professional Services",
    NAICS %in% c("55") ~ "Management",
    NAICS %in% c("56") ~ "Administrative Support",
    NAICS %in% c("61") ~ "Educational Services",
    NAICS %in% c("62") ~ "Healthcare",
    NAICS %in% c("71") ~ "Arts",
    NAICS %in% c("72") ~ "Food Services",
    NAICS %in% c("81") ~ "Other Services",
    NAICS %in% c("92") ~ "Public Administration"
  )
)
```






## Calculating the Default Rates

Next, we will calculate the default rates of the NAICS.combined categorical variable which we created in the previous part. 

For this step, the value CHGOFF in MIS_Status is defined to be loan default. The default rate will be the percentage of CHGOFF in MIS_Status.

```{r}
default.rates <- loan.updated %>%
  group_by(NAICS.combined) %>%
  summarise(total = n(),
            defaults = sum(MIS_Status == "CHGOFF", na.rm = TRUE),
            default.rates = defaults / total)
```

We can view the values of the default rates which we calculated. 

```{r}
default.rates
```

Now, we have successfully calculated the default rates for our NAICS.combined variable which we created to group together the sparse categories. 




## Discretizing the GrAppv Variable

We will discretize  the variable of GrAppv into five categories. This variable represents the gross amount of the loan that has been approved by the bank.

We will create these categories based upon the overall level of he value of the GrAppv value. These subcategories will split the GrAppv variable into five levels, "lowest", "low", "medium", "high", and "highest" based upon the value of the gross amount of loan that has been approved by the ban.

First, let's look at a quick summary of the original GrAppv variable.

```{r}
summary(loan.updated$GrAppv)
```

As we can see, the minimum value of the variable is $200, and the maximum value is $5,472,000. The median value is $90,000 indicating that the maximum value differs much more greatly from the median value than the minimum value does. 

We will split the GrAppv variable into five subcategories based upon the value of each observation. We will create a new variable called "GrAppv.dis" to represent the discretized version of the original GrAppv variable. 

```{r}
loan.updated$GrAppv.dis <- cut(loan.updated$GrAppv,
                                   breaks = 5,
                                   labels = c("Lowest", "Low", "Medium", "High", "Highest"))
```

Now we have five subcategories of the GrAppv variable split up from the lowest to the highest. 




## Density Curves

Lastly, we will draw the density curves of SBA_Appv for each of the five sub-populations that were created in the previous part. We will place these five density curves onto the same plot. 

```{r}
ggplot(loan.updated, aes(x = SBA_Appv, fill = GrAppv.dis)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Curves of SBA_Appv for the \n Five SubCategories of GrAppv",
       x = "SBA_Appv", y = "Density") 
```

In the graph, we can see the density curves of the SBA_Appv variable for each of the five categories of GrAppv that we created in the previous part. We can see that these five categories have been color-coded by their respective rank of "lowest", "low", "medium", "high", "or highest".


# Conclusion

We have completed the desired exploratory data analysis steps including dealing with missing values as well as updating some of the variables within the original combined bank loan data set. We now have a new data set called "loan.updated" with all of the desired updates to the data set. 
