For this report, I will be analyzing bank loan default data. The goal of this analysis is to clean up this data set and do some exploratory data analysis.
First, we load all of the data. This data set is split into 9 different files.
loan01 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)
The first thing I will do is remove observations in which the value of MIS_Status is missing. A ‘new’ data set with those observations removed will be created. Further EDA will be done with the reduced set.
miss = loan[which(loan$MIS_Status == ""),]
nrow(miss)
## [1] 1997
newloan <- loan[-which(loan$MIS_Status == ""),]
Next, all of the currency variables (DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv) will be changed. All of their dollar signs and commas will be removed, and all values will be converted to numeric.
newloan$DisbursementGross<- as.numeric(gsub("[\\$,]", "", newloan$DisbursementGross))
newloan$BalanceGross <- as.numeric(gsub("[\\$,]", "", newloan$BalanceGross))
newloan$ChgOffPrinGr <- as.numeric(gsub("[\\$,]", "", newloan$ChgOffPrinGr))
newloan$GrAppv <- as.numeric(gsub("[\\$,]", "", newloan$GrAppv))
newloan$SBA_Appv <- as.numeric(gsub("[\\$,]", "", newloan$SBA_Appv))
Next, I will organize every observation by region. Specifically, i will group states by the geographical region. There will be 5 regions total: Northeast, southeast, Southwest, Midwest, and West.
First, let’s generate a frequency table of loans by state.
EachState = substr(newloan$State, 1, 2)
newloan$State = EachState
ftable = table(newloan$State)
kable(t(ftable))
| AK | AL | AR | AZ | CA | CO | CT | DC | DE | FL | GA | HI | IA | ID | IL | IN | KS | KY | LA | MA | MD | ME | MI | MN | MO | MS | MT | NC | ND | NE | NH | NJ | NM | NV | NY | OH | OK | OR | PA | RI | SC | SD | TN | TX | UT | VA | VT | WA | WI | WV | WY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | 2403 | 8360 | 6333 | 17624 | 130488 | 20598 | 12127 | 1613 | 2193 | 41192 | 22272 | 3607 | 11986 | 9496 | 29594 | 14079 | 11455 | 7720 | 9570 | 25092 | 13259 | 5937 | 20493 | 24359 | 20652 | 7668 | 8737 | 14286 | 5237 | 6381 | 12023 | 24010 | 6026 | 8022 | 57394 | 32498 | 9945 | 11041 | 34801 | 8948 | 5595 | 4428 | 9400 | 70401 | 18772 | 13207 | 5433 | 23255 | 21021 | 3285 | 2838 |
From this output, we can see that there are missing values for state. They will be dropped from this data set.
newloan1 <- newloan[-which(newloan$State == ""),]
Now, the observations will be grouped by region.
newloan1 <- newloan1 %>%
mutate(State = case_when(
State %in% c("AL", "AR", "GA", "KY", "LA", "MS", "SC", "TN", "WV", "OK") ~ "SouthWest",
State %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
State %in% c("CT", "DE", "MA", "MD", "ME", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
State %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
State %in% c("FL", "NC", "TX", "VA", "DC") ~ "Southeast",
TRUE ~ as.character(State)
))
Now the loan default rates for every region will be calculated.
rates <- newloan1 %>%
group_by(State) %>%
summarise(Default_Rates = mean(MIS_Status == "CHGOFF"))
rates
## # A tibble: 5 × 2
## State Default_Rates
## <chr> <dbl>
## 1 Midwest 0.159
## 2 Northeast 0.162
## 3 SouthWest 0.194
## 4 Southeast 0.214
## 5 West 0.172
Next, we will split the entire data set into 5 subpopulations via GrAppv.
sortGroup <- c(1:5)
newloan2 <- split(newloan1$GrAppv, sortGroup)
summary(newloan2)
## Length Class Mode
## 1 179431 -none- numeric
## 2 179431 -none- numeric
## 3 179431 -none- numeric
## 4 179431 -none- numeric
## 5 179430 -none- numeric
$ Density Curves for SBA_Appv
Finally, we will draw the density curves for SBA_Appv based on the groups we defined in part 5.
I did some exploratory data analysis on data concerning bank loans. First, i combined all 9 subsets of data into one massive data set. Then, i removed all missing values for the variable MIS_Status. Then, the currency variables were changed from character variables to numeric and had their dollar signs and commas removed. Third, the observations were sorted into geographical regions by state, and default rates for each region were calculated. Next, observations were then sorted into 5 groups via Gr_Appv. And finally, Density curves were generated for each group based on the variable SBA_Appv.