For this report, I will be analyzing bank loan default data. The goal of this analysis is to clean up this data set and do some exploratory data analysis.
First, we load all of the data. This data set is split into 9 different files.
loan01 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
The first thing I will do is remove observations in which the value of MIS_Status is missing. A ‘new’ data set with those observations removed will be created. Further EDA will be done with the reduced set.
miss = loan[which(loan$MIS_Status == ""),]
nrow(miss)
## [1] 1997
newloan <- loan[-which(loan$MIS_Status == ""),]
Next, all of the currency variables (DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv) will be changed. All of their dollar signs and commas will be removed, and all values will be converted to numeric.
newloan$DisbursementGross<- as.numeric(gsub("[\\$,]", "", newloan$DisbursementGross))
newloan$BalanceGross <- as.numeric(gsub("[\\$,]", "", newloan$BalanceGross))
newloan$ChgOffPrinGr <- as.numeric(gsub("[\\$,]", "", newloan$ChgOffPrinGr))
newloan$GrAppv <- as.numeric(gsub("[\\$,]", "", newloan$GrAppv))
newloan$SBA_Appv <- as.numeric(gsub("[\\$,]", "", newloan$SBA_Appv))
Next, I will organize every observation by region. Specifically, i will group states by the geographical region. There will be 5 regions total: Northeast, southeast, Southwest, Midwest, and West.
First, let’s generate a frequency table of loans by state.
EachState = substr(newloan$State, 1, 2)
newloan$State = EachState
ftable = table(newloan$State)
kable(t(ftable))
| AK | AL | AR | AZ | CA | CO | CT | DC | DE | FL | GA | HI | IA | ID | IL | IN | KS | KY | LA | MA | MD | ME | MI | MN | MO | MS | MT | NC | ND | NE | NH | NJ | NM | NV | NY | OH | OK | OR | PA | RI | SC | SD | TN | TX | UT | VA | VT | WA | WI | WV | WY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | 2403 | 8360 | 6333 | 17624 | 130488 | 20598 | 12127 | 1613 | 2193 | 41192 | 22272 | 3607 | 11986 | 9496 | 29594 | 14079 | 11455 | 7720 | 9570 | 25092 | 13259 | 5937 | 20493 | 24359 | 20652 | 7668 | 8737 | 14286 | 5237 | 6381 | 12023 | 24010 | 6026 | 8022 | 57394 | 32498 | 9945 | 11041 | 34801 | 8948 | 5595 | 4428 | 9400 | 70401 | 18772 | 13207 | 5433 | 23255 | 21021 | 3285 | 2838 |
From this output, we can see that there are missing values for state. They will be dropped from this data set.
newloan1 <- newloan[-which(newloan$State == ""),]
Now, the observations will be grouped by region.
newloan1 <- newloan1 %>%
mutate(State = case_when(
State %in% c("AL", "AR", "GA", "KY", "LA", "MS", "SC", "TN", "WV", "OK") ~ "SouthWest",
State %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
State %in% c("CT", "DE", "MA", "MD", "ME", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
State %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
State %in% c("FL", "NC", "TX", "VA", "DC") ~ "Southeast",
TRUE ~ as.character(State)
))
Now the loan default rates for every region will be calculated.
rates <- newloan1 %>%
group_by(State) %>%
summarise(Default_Rates = mean(MIS_Status == "CHGOFF"))
rates
## # A tibble: 5 × 2
## State Default_Rates
## <chr> <dbl>
## 1 Midwest 0.159
## 2 Northeast 0.162
## 3 SouthWest 0.194
## 4 Southeast 0.214
## 5 West 0.172
Next, we will split the entire data set into 5 subpopulations via GrAppv.
splitgroup <- c(1:5)
newloan2 <- split(newloan1$GrAppv, splitgroup)
summary(newloan2)
## Length Class Mode
## 1 179431 -none- numeric
## 2 179431 -none- numeric
## 3 179431 -none- numeric
## 4 179431 -none- numeric
## 5 179430 -none- numeric
$ Density Curves for SBA_Appv