For this report, I will be analyzing bank loan default data. The goal of this analysis is to clean up this data set and do some exploratory data analysis.

First, we load all of the data. This data set is split into 9 different files.

loan01 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational01.csv", header = TRUE)[, -1]
loan02 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational02.csv", header = TRUE)[, -1]
loan03 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational03.csv", header = TRUE)[, -1]
loan04 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational04.csv", header = TRUE)[, -1]
loan05 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational05.csv", header = TRUE)[, -1]
loan06 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational06.csv", header = TRUE)[, -1]
loan07 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational07.csv", header = TRUE)[, -1]
loan08 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational08.csv", header = TRUE)[, -1]
loan09 <- read.csv("https://pengdsci.github.io/datasets/SBAloan/w06-SBAnational09.csv", header = TRUE)[, -1]
loan = rbind(loan01, loan02, loan03, loan04, loan05, loan06, loan07, loan08, loan09)

1 Missing Values

The first thing I will do is remove observations in which the value of MIS_Status is missing. A ‘new’ data set with those observations removed will be created. Further EDA will be done with the reduced set.

miss = loan[which(loan$MIS_Status == ""),]
nrow(miss)
## [1] 1997
newloan <- loan[-which(loan$MIS_Status == ""),]

2 Currency Variables

Next, all of the currency variables (DisbursementGross, BalanceGross, ChgOffPrinGr, GrAppv, and SBA_Appv) will be changed. All of their dollar signs and commas will be removed, and all values will be converted to numeric.

newloan$DisbursementGross<- as.numeric(gsub("[\\$,]", "", newloan$DisbursementGross))
newloan$BalanceGross <- as.numeric(gsub("[\\$,]", "", newloan$BalanceGross))
newloan$ChgOffPrinGr <- as.numeric(gsub("[\\$,]", "", newloan$ChgOffPrinGr))
newloan$GrAppv <- as.numeric(gsub("[\\$,]", "", newloan$GrAppv))
newloan$SBA_Appv <- as.numeric(gsub("[\\$,]", "", newloan$SBA_Appv))

3 Combining Categories

Next, I will organize every observation by region. Specifically, i will group states by the geographical region. There will be 5 regions total: Northeast, southeast, Southwest, Midwest, and West.

First, let’s generate a frequency table of loans by state.

EachState = substr(newloan$State, 1, 2)   
newloan$State = EachState         
ftable = table(newloan$State)
kable(t(ftable))
AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
13 2403 8360 6333 17624 130488 20598 12127 1613 2193 41192 22272 3607 11986 9496 29594 14079 11455 7720 9570 25092 13259 5937 20493 24359 20652 7668 8737 14286 5237 6381 12023 24010 6026 8022 57394 32498 9945 11041 34801 8948 5595 4428 9400 70401 18772 13207 5433 23255 21021 3285 2838

From this output, we can see that there are missing values for state. They will be dropped from this data set.

newloan1 <- newloan[-which(newloan$State == ""),]

Now, the observations will be grouped by region.

newloan1 <- newloan1 %>%
  mutate(State = case_when(
    State %in% c("AL", "AR", "GA", "KY", "LA", "MS", "SC", "TN", "WV", "OK") ~ "SouthWest",
    State %in% c("AK", "AZ", "CA", "CO", "HI", "ID", "MT", "NV", "NM", "OR", "UT", "WA", "WY") ~ "West",
    State %in% c("CT", "DE", "MA", "MD", "ME", "NH", "NJ", "NY", "PA", "RI", "VT") ~ "Northeast",
    State %in% c("IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI") ~ "Midwest",
    State %in% c("FL", "NC", "TX", "VA", "DC") ~ "Southeast",
    TRUE ~ as.character(State)
  ))

4 Calculating Default Rates Of each region

Now the loan default rates for every region will be calculated.

rates <- newloan1 %>%
  group_by(State) %>%
  summarise(Default_Rates = mean(MIS_Status == "CHGOFF"))

rates
## # A tibble: 5 × 2
##   State     Default_Rates
##   <chr>             <dbl>
## 1 Midwest           0.159
## 2 Northeast         0.162
## 3 SouthWest         0.194
## 4 Southeast         0.214
## 5 West              0.172

5 Splitting GrAppv

Next, we will split the entire data set into 5 subpopulations via GrAppv.

sortGroup <- c(1:5)

newloan2 <- split(newloan1$GrAppv, sortGroup)  
summary(newloan2)
##   Length Class  Mode   
## 1 179431 -none- numeric
## 2 179431 -none- numeric
## 3 179431 -none- numeric
## 4 179431 -none- numeric
## 5 179430 -none- numeric

$ Density Curves for SBA_Appv

Finally, we will draw the density curves for SBA_Appv based on the groups we defined in part 5.

6 Summary and Conclusion

I did some exploratory data analysis on data concerning bank loans. First, i combined all 9 subsets of data into one massive data set. Then, i removed all missing values for the variable MIS_Status. Then, the currency variables were changed from character variables to numeric and had their dollar signs and commas removed. Third, the observations were sorted into geographical regions by state, and default rates for each region were calculated. Next, observations were then sorted into 5 groups via Gr_Appv. And finally, Density curves were generated for each group based on the variable SBA_Appv.