1 Combined Data

loan01 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
# dim(bankLoan)
#names(bankLoan)

2 Mis_Status Missing

loan <- loan[!is.na(loan$MIS_Status), ]

Mis_Status is the status of the persons loan. This is one of the most important variables in our dataset. It ensures us that when we do our analysis we have a complete observation telling us if the loan was approved, paid, or defaulted. So we remove any observations with loan status missing for clarity.

3 Convert Variables

currency_vars <- c("DisbursementGross", "BalanceGross", "ChgOffPrinGr", "GrAppv", "SBA_Appv")

loan[currency_vars] <- lapply(loan[currency_vars], function(x) as.numeric(gsub("[$,]", "", x)))

Since DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv are important measures for our analysis we want to be able to use them more methodically. So we convert these categorical variables to numeric variables and remove any character values. Working with numbers as opposed to charterers is better for statistical analysis, visualization, and overall modeling. ## Categorical Variable rework

loan <- loan %>%
  mutate(BankRegion = case_when(
    BankState %in% c("CT", "ME", "MA", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
    BankState %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
    BankState %in% c("AL", "AR", "DE", "DC", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "OK", "SC", "TN", "TX", "VA", "WV") ~ "South",
    BankState %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
    TRUE ~ "Unknown"
  ))

table(loan$BankRegion)

  Midwest Northeast     South   Unknown      West 
   269885    150560    275751      1730    201238 

I decided to make 4 regional categories of the states. These categories being Northeast, Midwest, South, West and also an unknown. By doing this we can see if there are any patterns for any specific region. We can use this to compare approval rates, default rates, or average loan amounts by region. This also makes it easier for data visualization to again see if there is any disparities by region.

4 Default Rates based on Region

default_rates <- loan %>%
  group_by(BankRegion) %>%
  summarise(
    Total_Loans = n(),
    Defaults = sum(MIS_Status == "CHGOFF", na.rm = TRUE),
    Default_Rate = Defaults / Total_Loans * 100
  )

print(default_rates)
# A tibble: 5 × 4
  BankRegion Total_Loans Defaults Default_Rate
  <chr>            <int>    <int>        <dbl>
1 Midwest         269885    42537        15.8 
2 Northeast       150560    21021        14.0 
3 South           275751    58751        21.3 
4 Unknown           1730      106         6.13
5 West            201238    35143        17.5 

This code calculates the default rate of SBA-backed loans for each region. With this code it helps us to compare which regions have higher or lower default rates. It also gives insight into regional economic trends and lending practices. We see that in the South the default rate is the highest at 21% meaning that more loans go unpaid here than in any other region.

5 Discretize GrApprv

loan <- loan %>%
  mutate(GrAppv_Cat = cut(GrAppv, 
                          breaks = quantile(GrAppv, probs = seq(0, 1, by = 0.2), na.rm = TRUE),
                          include.lowest = TRUE,
                          labels = c("Very Low", "Low", "Medium", "High", "Very High")))

table(loan$GrAppv_Cat)

 Very Low       Low    Medium      High Very High 
   180997    179102    179575    179677    179813 

In the code above we categorize the SBA guaranteed approval amount into five groups. By categorizing and making this variable into small intervals we convert it from a continuous variable we now have a discretized variable. Discretized variables are easier to interpret. We see from the output above that they are evenly distributed meaning our approval spans a wide range of amounts.

6 SBA_Appr Curves

ggplot(loan, aes(x = SBA_Appv, fill = GrAppv_Cat, color = GrAppv_Cat)) +
  geom_density(alpha = 0.3) + 
  labs(title = "Density Curves of SBA Approval Amount by Loan Size",
       x = "SBA Approval Amount (SBA_Appv)", 
       y = "Density") +
  theme_minimal() +
  theme(legend.title = element_blank()) +  
  coord_cartesian(xlim = c(0, 500000))

The code above shows a density plot of the distribution of SBA approval amounts, categorized by loan size. From the visualization above we can see which loan sizes are most common and helps us see if certain loan categories are higher. The density peaks tell us where most SBA approval amounts fall.

---
title: "EDA Bank Loan"
author: 'Tyler Battaglini'
date: "2025-3-2"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
always_allow_html: true
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-weight: bold;
    font-family: system-ui;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# Detect, install and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("nleqslv")) {
   install.packages("nleqslv")
   library(nleqslv)
}
#
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}

if (!require("psych")) {   
  install.packages("psych")
   library(psych)
}
if (!require("MASS")) {   
  install.packages("MASS")
   library(MASS)
}
if (!require("ggplot2")) {   
  install.packages("ggplot2")
   library(ggplot2)
}
if (!require("GGally")) {   
  install.packages("GGally")
   library(GGally)
}
if (!require("car")) {   
  install.packages("car")
   library(car)
}
if (!require("dplyr")) {   
  install.packages("dplyr")
   library(dplyr)
}
if (!require("caret")) {   
  install.packages("caret")
   library(caret)
}
if (!require("readxl")) {   
  install.packages("readxl")
   library(readxl)
}
if (!require("openxlsx")) {   
  install.packages("openxlsx")
   library(openxlsx)
}
if (!require("forecast")) {   
  install.packages("forecast")
   library(forecast)
}
# specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,      # include code chunk in the output file
                      warnings = FALSE,  # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      messages = FALSE,  #
                      results = TRUE,
                      
                      comment = NA       # you can also decide whether to include the output
                                         # in the output file.
                      )   
```


## Combined Data

```{r}
loan01 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 = read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
# dim(bankLoan)
#names(bankLoan)



```

## Mis_Status Missing

```{r}
loan <- loan[!is.na(loan$MIS_Status), ]

```

Mis_Status is the status of the persons loan. This is one of the most important variables in our dataset. It ensures us that when we do our analysis we have a complete observation telling us if the loan was approved, paid, or defaulted. So we remove any observations with loan status missing for clarity.

## Convert Variables

```{r}
currency_vars <- c("DisbursementGross", "BalanceGross", "ChgOffPrinGr", "GrAppv", "SBA_Appv")

loan[currency_vars] <- lapply(loan[currency_vars], function(x) as.numeric(gsub("[$,]", "", x)))

```

Since DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv are important measures for our analysis we want to be able to use them more methodically. So we convert these categorical variables to numeric variables and remove any character values. Working with numbers as opposed to charterers is better for statistical analysis, visualization, and overall modeling.
## Categorical Variable rework

```{r}
loan <- loan %>%
  mutate(BankRegion = case_when(
    BankState %in% c("CT", "ME", "MA", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
    BankState %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
    BankState %in% c("AL", "AR", "DE", "DC", "FL", "GA", "KY", "LA", "MD", "MS", "NC", "OK", "SC", "TN", "TX", "VA", "WV") ~ "South",
    BankState %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
    TRUE ~ "Unknown"
  ))

table(loan$BankRegion)
```

I decided to make 4 regional categories of the states. These categories being Northeast, Midwest, South, West and also an unknown. By doing this we can see if there are any patterns for any specific region. We can use this to compare approval rates, default rates, or average loan amounts by region. This also makes it easier for data visualization to again see if there is any disparities by region. 

## Default Rates based on Region

```{r}
default_rates <- loan %>%
  group_by(BankRegion) %>%
  summarise(
    Total_Loans = n(),
    Defaults = sum(MIS_Status == "CHGOFF", na.rm = TRUE),
    Default_Rate = Defaults / Total_Loans * 100
  )

print(default_rates)
```

This code calculates the default rate of SBA-backed loans for each region. With this code it helps us to compare which regions have higher or lower default rates. It also gives insight into regional economic trends and lending practices. We see that in the South the default rate is the highest at 21% meaning that more loans go unpaid here than in any other region. 

## Discretize GrApprv

```{r}
loan <- loan %>%
  mutate(GrAppv_Cat = cut(GrAppv, 
                          breaks = quantile(GrAppv, probs = seq(0, 1, by = 0.2), na.rm = TRUE),
                          include.lowest = TRUE,
                          labels = c("Very Low", "Low", "Medium", "High", "Very High")))

table(loan$GrAppv_Cat)
```

In the code above we categorize the SBA guaranteed approval amount into five groups. By categorizing and making this variable into small intervals we convert it from a continuous variable we now have a discretized variable. Discretized variables are easier to interpret. We see from the output above that they are evenly distributed meaning our approval spans a wide range of amounts. 

## SBA_Appr Curves

```{r}
ggplot(loan, aes(x = SBA_Appv, fill = GrAppv_Cat, color = GrAppv_Cat)) +
  geom_density(alpha = 0.3) + 
  labs(title = "Density Curves of SBA Approval Amount by Loan Size",
       x = "SBA Approval Amount (SBA_Appv)", 
       y = "Density") +
  theme_minimal() +
  theme(legend.title = element_blank()) +  
  coord_cartesian(xlim = c(0, 500000))
```

The code above shows a density plot of the distribution of SBA approval amounts, categorized by loan size. From the visualization above we can see which loan sizes are most common and helps us see if certain loan categories are higher. The density peaks tell us where most SBA approval amounts fall. 






