For this report, I will be analyzing bank loan default data. The goal of this analysis is to clean up this data set and do some exploratory data analysis.

First, we load all of the data. This data set is split into 9 different files.

loan01 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)

1 Missing Values

The first thing I will do is remove observations in which the value of MIS_Status is missing. A ‘new’ data set with those observations removed will be created. Further EDA will be done with the reduced set.

miss = loan[which(loan$MIS_Status == ""),]
nrow(miss)

## [1] 1997

newloan <- loan[-which(loan$MIS_Status == ""),]

2 Currency Variables

Next, all of the currency variables (DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv) will be changed. All of their dollar signs and commas will be removed, and all values will be converted to numeric.

newloan$DisbursementGross<- as.numeric(gsub("[\\$,]", "", newloan$DisbursementGross))
newloan$BalanceGross <- as.numeric(gsub("[\\$,]", "", newloan$BalanceGross))
newloan$ChgOffPrinGr <- as.numeric(gsub("[\\$,]", "", newloan$ChgOffPrinGr))
newloan$GrAppv <- as.numeric(gsub("[\\$,]", "", newloan$GrAppv))
newloan$SBA_Appv <- as.numeric(gsub("[\\$,]", "", newloan$SBA_Appv))

3 Combining Categories

Next, I will organize every observation by region. Specifically, i will group states by the geographical region. There will be 5 regions total: Northeast, southeast, Southwest, Midwest, and West.

First, let’s generate a frequency table of loans by state.

EachState = substr(newloan$State, 1, 2)   
newloan$State = EachState         
ftable = table(newloan$State)
kable(t(ftable))

	AK	AL	AR	AZ	CA	CO	CT	DC	DE	FL	GA	HI	IA	ID	IL	IN	KS	KY	LA	MA	MD	ME	MI	MN	MO	MS	MT	NC	ND	NE	NH	NJ	NM	NV	NY	OH	OK	OR	PA	RI	SC	SD	TN	TX	UT	VA	VT	WA	WI	WV	WY
13	2403	8360	6333	17624	130488	20598	12127	1613	2193	41192	22272	3607	11986	9496	29594	14079	11455	7720	9570	25092	13259	5937	20493	24359	20652	7668	8737	14286	5237	6381	12023	24010	6026	8022	57394	32498	9945	11041	34801	8948	5595	4428	9400	70401	18772	13207	5433	23255	21021	3285	2838

From this output, we can see that there are missing values for state. They will be dropped from this data set.

newloan1 <- newloan[-which(newloan$State == ""),]

Now, the observations will be grouped by region.

newloan1 <- newloan1 %>%
  mutate(State = case_when(
    State %in% c("AL", "AR", "GA", "KY", "LA", "MS", "SC", "TN", "WV", "OK") ~ "SouthWest",
    State %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
    State %in% c("CT", "DE", "MA", "MD", "ME", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
    State %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
    State %in% c("FL", "NC", "TX", "VA", "DC") ~ "Southeast",
    TRUE ~ as.character(State)
  ))

4 Calculating Default Rates Of each region

Now the loan default rates for every region will be calculated.

rates <- newloan1 %>%
  group_by(State) %>%
  summarise(Default_Rates = mean(MIS_Status == "CHGOFF"))

rates

## # A tibble: 5 × 2
##   State     Default_Rates
##   <chr>             <dbl>
## 1 Midwest           0.159
## 2 Northeast         0.162
## 3 SouthWest         0.194
## 4 Southeast         0.214
## 5 West              0.172

5 Splitting GrAppv

Next, we will split the entire data set into 5 subpopulations via GrAppv.

splitgroup <- c(1:5)
newloan2 <- split(newloan1$GrAppv, splitgroup)

summary(newloan2)

##   Length Class  Mode   
## 1 179431 -none- numeric
## 2 179431 -none- numeric
## 3 179431 -none- numeric
## 4 179431 -none- numeric
## 5 179430 -none- numeric

$ Density Curves for SBA_Appv

Week 6 Homework: Descriptive Analysis of Bank Loan Data

Ian VanWright

03/03/2024

1 Missing Values

2 Currency Variables

3 Combining Categories

4 Calculating Default Rates Of each region

5 Splitting GrAppv