Data Wrangling in R Project

Exploratory Data Analysis of Bank failures since the Great Depression

Synopsis

When does a bank fail?

A bank typically fails for one of the two reasons: Insolvency or Illiquidity. Insolvency occurs when the value of the assets held by a bank is less than the value of the liabilities(Deposits) held. Illiquidity occurs when a bank is not able to meet its current obligations as they come due.

What is the role of FDIC(Federal Deposit Insurance Corporation) in the event of bank failure?

In the event of a bank failure, the FDIC acts in two capacities. First, as the insurer of the bank’s deposits, the FDIC pays insurance to the depositors up to the insurance limit. Second, the FDIC, as the “Receiver” of the failed bank, assumes the task of selling/collecting the assets of the failed bank and settling its debts, including claims for deposits in excess of the insured limit.

The dataset holds information of all the banks in US which have failed from 1934 to present.

Problem Statement

What are the risk indicators which a depositor should take into consideration while depositing money into a bank?. The exploratory data analysis focuses on the typical risk indicators leading to a bank failure.

Solution Overview

Observing the typical characteristics of a failed bank using data visualization and summary statistics which will aid in guiding a depositor to choose a bank which most likely would not run into a failure.

Packages required

The following packages were required:

dplyr: Used for data manipulation
tidyr: Used for tidying the data
stringr: Used for string manipulation
ggplot2: Used for data visualization
printr: Used to print R objects to appropriate formats
DT: Used to display R objects with DataTables
knitr: Used to display an aligned table on the screen
plotly: Used for creating interactive graphs and charts
minuUI: Shiny UI Widget for small screen
shiny: Used for miniUI
scales: Used for scaling in graphs

# Loading the libraries
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(printr)
library(DT)
library(knitr)
library(plotly)
library(miniUI)
library(shiny)
library(scales)

Data Preparation

Data Import

Source: Commercial bank failures, 1934 - Present from Kaggle

This dataset contains 12 variables and 3484 observations of the failure of Commercial Banks, savings Associations and Savings Banks since the establishment of the FDIC in 1933. The bank failure report is available on the FDIC website. Comprehensive data on estimated losses are not available for FDIC-insured failures prior to 1986. Estimated loss is presented as “N/A” in years for which comprehensive information is not available.

# URL to the github repository
url <- "https://raw.githubusercontent.com/sayalidw93/Data-Wrangling/master/project%20data.csv"
# Read csv file
bank_raw <- read.csv(url, stringsAsFactors = FALSE)

Display Class, Variables and Dimensions of the dataset

# Display class of the dataset
class(bank_raw)

## [1] "data.frame"

# Display variable names in the dataset
colnames(bank_raw)

##  [1] "Financial.Institution.Number" "Institution.Name"            
##  [3] "Institution.Type"             "Charter.Type"                
##  [5] "Headquarters"                 "Failure.Date"                
##  [7] "Insurance.Fund"               "Certificate.Number"          
##  [9] "Transaction.Type"             "Total.Deposits"              
## [11] "Total.Assets"                 "Estimated.Loss..2015."

# Display the dimensions of the dataset
dim(bank_raw)

## [1] 3484   12

Codebook

Codebook is available on the FDIC website. Description of the variables in the dataset is as follows:

Financial Institution Number - Financial Institution Number is a unique number assigned to the institution as an Assistance Agreement, Conservatorship, Bridge Bank or Receivership.
Institution Name - This is the legal name of the institution.
Institution Type - Commercial Bank, Savings Association and Savings Bank.
Charter Type - Legal authorization to conduct business granted by federal or state government.
Headquarters - The city and state of the headquarters of location.
Failure Date - The date that the failed institution ceased to exist as a privately held going concern.
Insurance Fund - Insurance funds used for insuring the depositors.
Certificate Number - The certificate number assigned by the FDIC used to identify institutions and for the issuance of insurance certificates.
Transaction Type - Transaction types used to resolve the institution failures.
Total Deposits - Total deposits to the institution which are filed by the institution prior to the failure date (includes demand deposits, money market deposits, other savings deposits, time deposits and deposits in foreign offices).
Total Assets - The total assets owned and filed by the institution prior to the failure date (includes cash, loans, securities, bank premises and other assets).
Estimated Loss(2015) - The estimated loss is the difference between the amount disbursed from the Deposit Insurance Fund to cover obligations to insured depositors and the amount estimated to be ultimately recovered from the liquidation of the receivership estate.

Data Cleaning

Missing values in the data:

I checked the data for missing values.

# Number of missing values for each variable
colSums(is.na(bank_raw))

## Financial.Institution.Number             Institution.Name 
##                          601                            0 
##             Institution.Type                 Charter.Type 
##                            0                            0 
##                 Headquarters                 Failure.Date 
##                            0                            0 
##               Insurance.Fund           Certificate.Number 
##                            0                          485 
##             Transaction.Type               Total.Deposits 
##                            0                            0 
##                 Total.Assets        Estimated.Loss..2015. 
##                          151                          975

Thus, there are missing values for Financial Institution Number, Certificate Number, Total Assets and Estimated Losses. As the variables Financial Institution Number and Certificate Number are not much of a concern in my exploratory analysis for this dataset, I will drop these two variables. However, for Total Assets and Estimated Losses I have decided not to remove any rows with missing data yet as these variables are significant for analysis. I will take care of this issue while doing individual analysis.

# Dropping insignificant variables
bank_raw <- bank_raw %>% select(-Financial.Institution.Number,-Certificate.Number)

Splitting Failure Date and Headquarter variables

For the ease of analysis and to find relation between the variables, I split the Headquarters variable into Headquarter city and State variables, and the Failure Date variable into Month, Date and Year variables.

# Splitting the Headquarter and Failure Date variables
banks <- bank_raw %>%
        separate(Headquarters, c("Headquarter city","State"), sep = ",") %>%
        separate(Failure.Date, c("Month","Date","Year"), sep = "/")

Imputing data

In the process of splitting the variables, a warning popped up for 2914th observation. Let’s see the 2914th observation before and after separating the variables.

# Checking the record after the variables were split
banks[2914,]

	Institution.Name	Institution.Type	Charter.Type	Headquarter city	State	Month	Date	Year	Insurance.Fund	Transaction.Type	Total.Deposits	Total.Assets	Estimated.Loss..2015.
2914	BANK USA, N.A.	COMMERCIAL BANK	FEDERAL	KIHEI	MAUI	5	19	1995	BIF	ACQUISITION	9124	9048	2593

# Checking the record before the variables were split
bank_raw[2914,]

	Institution.Name	Institution.Type	Charter.Type	Headquarters	Failure.Date	Insurance.Fund	Transaction.Type	Total.Deposits	Total.Assets	Estimated.Loss..2015.
2914	BANK USA, N.A.	COMMERCIAL BANK	FEDERAL	KIHEI, MAUI, HI	5/19/1995	BIF	ACQUISITION	9124	9048	2593

Thus, we see that, the state variable is MAUI instead of HI(Hawaii). Comparing it with the original dataset, I decided to replace the Value “MAUI” with “HI” for this record, as the state variable is an important variable in our analysis.

# Imputing value in the State variable
banks$State <- str_replace(banks$State, "MAUI","HI")

Data Preview

In the data, each row is the failed bank and the columns represent the information for each failed bank. Dollar amount is in thousands for Total Assets, Total Deposits and Estimated Losses.

Data

# Get a sense of cleaned data
inputPanel(
  div(style = 'overflow-x:scroll;width:900px',dt <- datatable(tail(banks,100)))
)

Structure and Summary of Clean Data

Data Dictionary

# Create a datatable to display structure of clean data
Var.type <- lapply(banks,class)
colnames(banks) <- c("Institution Name", "Institution Type", "Charter Type",       "Headquarter City", "State", "Month","Date", "Year", "Insurance Fund", "Transaction Type", "Total Deposits", "Total Assets", "Estimated Loss till 2015")
Var.name <- colnames(banks)
strdata <- as_data_frame(cbind(Var.name,Var.type))
colnames(strdata) <- c("Variable Name","Variable Type")
kable(strdata)

Variable Name	Variable Type
Institution Name	character
Institution Type	character
Charter Type	character
Headquarter City	character
State	character
Month	character
Date	character
Year	character
Insurance Fund	character
Transaction Type	character
Total Deposits	integer
Total Assets	integer
Estimated Loss till 2015	integer

Summary Statistics

# Summary Statistics
Total.Deposits <- summary(banks$`Total Deposits`)
Total.Assets <- summary(banks$`Total Assets`)
Estimated.Loss <- summary(banks$`Estimated Loss till 2015`)
# Merging the summary statistics for 3 variables in a data frame
summary_stats <- rbind(Total.Deposits,Total.Assets,Estimated.Loss)
# Display the summary statistics table
kable(summary_stats)

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
Total.Deposits	0	12293	41721	299776.18	156241.5	188260793	0
Total.Assets	14	15051	47523	404588.41	173478.0	307021614	151
Estimated.Loss	-4386	3852	13354	70644.35	44585.0	12438005	975

The summary statistics table gives us an idea about Mean, Median, Minimum, Maximum, Missing values, 1st Quartile and 3rd Quartile for the numeric variables of interest. The negative Minimum value of Estimated loss may indicate that those banks had a slight margin for profit, however going forward estimated to run out of money in the near future, hence opted for early bankruptcy.Thus, those are not unusual values and are not deleted.

Exploratory Data Analysis

This section aims to explore the various risk indicators of a bank failure.

Bank Failure Rate from 1934 - 2017

Analyze the the bank failure rate since the Great Depression

# Bank failure rate over the years
plot_ly(data=banks, x = ~Year, type = "histogram") %>%
  layout(title = 'A History of Bank Failures in the United States',
         yaxis = list(title='Number of institutions failed', zeroline = FALSE),
         xaxis =  list(title = 'Year',
                       titlefont=list(size=12),
                       tickfont=list(size=8), zeroline = FALSE),
         width = 800)

From the graph of History of Bank Failures in the United States, we observe three peaks. Firstly, from 1934 to 1941, second from early 1980s to early 1990s and third, from 2007 to 2012, which is very much intuitive. We know, after the stock market crash of October 1929, the Great Depression began from 1929 and lasted until 1939, resulting in the bank failure of almost half of those existed in the United States back then. From early 1980s to early 1990s, the United States faced something called the Savings and Loan Crisis which also resulted in a large number of bank failures. Then came the Great Recession lasting from 2007 to 2012, which culminated in the collapse of the financial sector (banks) again in the United States.

Across States

Analyze the bank failure across the various states

# StateMap 
# Add State Name column
banks$State <- str_replace(banks$State, " ","")
banks$`State Name` <- state.name[match(banks$State, state.abb)]
states_map <- map_data("state")

state_count1 <- banks %>%
  subset(`State Name` != "") %>%
  subset(`State Name` != "New Line") %>%
  group_by(`State Name`) %>%
  summarise(`Failure Count`=n())

state_count1$`State Name` <- tolower(state_count1$`State Name`)

g <- ggplot(state_count1, aes(map_id = `State Name`)) +
      geom_map(aes(fill = `Failure Count`), map = states_map) +
      scale_fill_gradient(low = 'light blue', high = 'dark blue')+
      ggtitle("Bank Failures across the various states in the US")+
      expand_limits(x = states_map$long, y = states_map$lat) +
      theme_void()

ggplotly(g, height = 600, width = 900)

From the United States Map, we observe that Texas faced 736 bank failures, the highest count in the United States. Followed by California, Illinois, Florida, Oklahoma, Louisiana, Georgia, Missouri, Kansas and Colorado. It would be interesting to know, which financial crisis period was the reason for failure of so many banks in these states. The heat map below gives us an exact picture.

# Check variation over State and Year
state_count <- banks %>%
  subset(`State Name` != "") %>%
  subset(`State Name` != "New Line") %>%
  group_by(`State Name`,Year) %>%
  summarise(`Failure Count`=n())

# HeatMap 
b <- ggplot(state_count,aes(Year,`State Name`))+
      geom_tile(aes(fill=`Failure Count`),colour="white")+
      theme(axis.text.x = element_text(angle = 90, hjust = 1,size =8, margin(3,3,3,3)),
            axis.text.y = element_text(hjust = 1,size =8, margin(2,2,2,2)),
            plot.margin = margin(10, 10, 10, 10))+
      scale_fill_gradient(low="light blue",high = "dark blue")+
      xlab("Year of Bank Failure")+
      ylab("State")+
      ggtitle("State-wise Bank Failure Rate over the Years")+
      guides(fill=FALSE)
ggplotly(b, height = 700, width = 900)

From the heat map we notice some interesting results. Texas again is highlighted here. It is compulsive to note that, Texas was the only state, which showed consistent bank failures between the Great Depression and the Savings and Loan crisis. It had the maximum number of failures during the Savings and Loan crisis. Followed by Louisiana, Oklahoma, Illinois, California and Florida. Thus, the savings and loan crisis which recorded a very large number of bank failure in the United States, mostly impacted the top 10 states with a high bank failure rate as observed in the previous state map. The Great Recession hit Washington, Florida, Georgia and Illinois the most. Missouri, New Jersey, New York, North Dakota, South Dakota and Wisconsin were most impacted by the Great Depression.

Across Institution Types

Analyze the Bank Failure Rate across different types of Banking Institutions

# Failure rate over Institution Type

banks %>% group_by(`Institution Type`, Year) %>% summarise(Count = n()) %>%
  ggplot(aes(x = Year, y = Count, group = `Institution Type`, col = `Institution Type`)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 7, margin(3,3,3,3)),
        plot.margin = margin(10, 30, 10, 10),
        legend.position = c(0.9,0.9)) +
  ggtitle("Bank Failures across the Institution Types in the US")+
  ylab("Number of Institutions failed")+
  geom_line() -> m
ggplotly(m, tooltip = c("x","y", "group"), width = 1000)

From this graph, we learn that, the Commercial banks were hit by all the three economic downfalls. Not only this, but also in time period between the Great Depression and the Savings and Loan crisis, Commercial Banks show a consistent failure rate. The Great Depression seems to have hit the Commercial banks majorly. Then comes the Savings and Loan crisis, and as the name suggests, it recorded a large number of Savings Associations and Savings Banks going Bankrupt. Even during the Savings and Loan crisis, Commercial Banks started failing first. The Savings and Loan Crisis is thus arguably the most catastrophic collapse of the banking industry since the Great Depression. Across the United States, more than 1000 banks failed, essentially ending one of the most secure sources of home mortgages. The crisis was felt twice over in Texas, where atleast half of the failed Savings and Loan Banks were based.

# Estimated Losses across different Institution Types
banks %>% ggplot(aes(x = `Institution Type`, y = `Estimated Loss till 2015`,
                     col = `Institution Type`,
                     text = paste('Bank: ' ,`Institution Name`)))+
  ggtitle("Estimated Losses for different Institution Types")+
  theme(axis.text.x = element_text(size = 7, margin(4,4,4,4)),
         axis.text.y = element_text(size = 7, margin(6,15,15,6)),
        plot.margin = margin(10, 20, 10, 15))+
  ylab("Estimated Losses in thousands")+
  scale_y_continuous(labels = dollar)+
  geom_point(alpha = 0.5) + geom_jitter() -> x
ggplotly(x, tooltip = c("x", "y", "text"), width = 1000)

The scatter plot shows the Estimated losses across the different types of banking institutions. The Estimated Losses are more or less similar for all the types of Banking Institutions. However, it is worth noting that Indymac Bank F.S.B and BankUnited F.S.B, the Savings Banks Insitutions had very large Estimated losses in bankruptcy.

Across Transaction Types

Analyze the Bank Failure Rate across different Transaction types

# Failure rate over Transaction Type

banks %>% group_by(`Transaction Type`, Year) %>% summarise(Count = n()) %>%
  ggplot(aes(x = Year, y = Count, group = `Transaction Type`, col = `Transaction Type`)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 7, margin(3,3,3,3)),
        plot.margin = margin(10, 20, 10, 10)) +
  ggtitle("Bank Failures across the Transaction Types in the US")+
  ylab("Number of Institutions failed")+
  geom_line() -> m
ggplotly(m, tooltip = c("x","y", "group"), width = 1000)

This graph gives us an idea about how the Banking System has evolved since the Great Depression. During the Great Depression we observe that the failed banks who had to payout were more than the ones who were acquired. This was one of the striking features of the Great Depression where the FDIC was unable to find acquirers for a lot of failed banks, and thus had to conduct a deposit payoff, which is often disruptive to the local community. But during the Savings and Loan crisis and during the Great Recession, we see a trend inclined more towards the acquisition of the failed banks. This can be owed to a large number prominent established banks existent during those periods. We also see that during the Savings and Loan Crisis, a large number of Transfer (Insured Deposit Transfer - IDT) for failed banks. This is because in 1983, to alleviate the problems caused due to Deposit Payoff, FDIC developed a new resolution alternative: the IDT. Using this method, all of the insured deposits are transferred to a healthy financial institution and are available immediately after the failed bank is closed. However, during the Great Recession, we see a very large percentage of acquisitions for the failed banks and no transfers at all. This only suggests that, over the years, the prominent banks only grew, and opted for acquisition over transfer, resultion in acquisition of over 90% of failed banks during the Great Recession. We also observe that, after the Great Depression, until 1975, Payouts dominated over Acquisitions, but we see the trend changing after 1975, where acquisition of failed banks was predominant, suggesting, gradual establishment of larger banks over time.

Now, we analyze the estimated losses for the different transaction types.

banks %>% ggplot(aes(x = `Transaction Type`, y = `Estimated Loss till 2015`,
                     col = `Transaction Type`,
                     text = paste('Bank: ' ,`Institution Name`)))+
  ggtitle("Estimated Losses for different Transaction Types")+
   theme(axis.text.x = element_text(size = 7, margin(4,4,4,4)),
         axis.text.y = element_text(size = 7, margin(6,15,15,6)),
        plot.margin = margin(10, 20, 10, 15)) +
  ylab("Estimated Losses in thousands")+
  scale_y_continuous(labels = dollar)+
  geom_point(alpha = 0.5) + geom_jitter() -> x
ggplotly(x, tooltip = c("x", "y", "text"), width = 1000)

From the graph, we see an odd observation where the estimated losses are very high for the failed banks which were acquired. Thus, it only suggests, that the failed banks were acquired despite high losses, because there could be other anticipated positive outcomes out of acquisition, like expanded customer base, also, the acquiring banks were so well established that they could absorb all the short term losses born during the acquisition with an eye for long term gains.

Across Charter Types

Analyze the Bank Failure Rate across different Charter types

# Failure rate over Charter Type

banks %>% group_by(`Charter Type`, Year) %>% summarise(Count = n()) %>%
  ggplot(aes(x = Year, y = Count, group = `Charter Type`, col = `Charter Type`)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 7, margin(3,3,3,3)),
        plot.margin = margin(10, 20, 10, 10)) +
  ggtitle("Bank Failures across the Charter Types in the US")+
  ylab("Number of Institutions failed")+
  geom_line() -> m
ggplotly(m, tooltip = c("x","y", "group") , width = 1000)

From this graph, we observe that, State Charter Banks were impacted during all the three economic downfalls followed by the Federal Charter Banks. However, during the Savings and Loans Crisis, we see the Federal/State Charter banks were hit the most. They performed comparatively well during the Great Recession.

Relation between Deposits and Assets

Analyze the relationship between Total Deposits and Total Assets of Failed Banks

# Relation between Deposits and Assets
banks %>% ggplot(aes(x = `Total Assets`, y = `Total Deposits`,
                     col = `Institution Type`, text = paste('Bank: ' ,`Institution Name`)))+
  theme(axis.text.x = element_text(size = 7, margin(4,4,4,4)),
         axis.text.y = element_text(size = 7, margin(6,20,20,6)),
        plot.margin = margin(10, 20, 10, 20))+
  scale_y_continuous(labels = dollar)+
  scale_x_continuous(labels = dollar)+
  ylab("Total Deposits in thousands")+
  xlab("Total Assets in thousands")+
  ggtitle("Relation between Total Deposits and Total Assets")+
  geom_point(alpha = 0.5) -> x
ggplotly(x, width = 900)

Correlation between Total Deposits and Total Assets

temp <- banks %>% na.omit()
cor(temp$`Total Assets`, temp$`Total Deposits`)

## [1] 0.9982641

We see a typical feature of Bankrupcty being highlighted from the above scatter plot. Total Deposits and Total Assets have a near linear relationship with a correlation of 0.99. Thus, banks are declared bankrupt when the difference between their Total Assets and Total Deposits diminishes. We see one bank which failed which had a very high value of Total Assets and Total Deposits. Its the Washington Mutual Bank which came in the news as the largest failed bank in the US history.

Summary

Problem Statement This analysis is intended to help consumers by highlighting the typical risk indicators of bank failures since the Great Depression.

Methodology This analysis focuses mainly on data visualizations to gain insights from data.

Insights

We see that the Great Depression, the Savings and Loan crisis and the Great Recession observed majority of the bank failures since 1934, suggesting market conditions led to their failure. This also suggests, that the banks failing in years apart from these time periods, failed because of their own insufficiency.
Texas was the state, which had the most bank failures since the Great Depression, with most of the failures observed during the Savings and Loan crisis, where half of the banks failed during this period were based here. Thus, Texas was majorly hit by the Savings and Loan Crisis. Also, Texas, has shown a near consistent failure rate in between the Great Depression and the Savings and Loan crisis.The other states, which showed a high failure trend in general were California, Illinois, Florida, Oklahoma, Louisiana, Georgia, Missouri, Kansas and Colorado.
Typically, Commercial Banks were hit during all the three economic crises, indicating their higher tendency for failure over Savings Association and Savings Banks. However, the Savings and Loan crises, which was the most catastrophic collapse after the Great Depression, resulted in the failure of almost 1600 Savings Associations and Saving banks.
Failure rate across Transaction Types suggest how the banking system has evolved since the Great Depression. The Great Depression observed lot of Payouts for failed banks, thus, causing a lot of disruption to the local community. During The trend changed from 1975, where acquisition of failed banks became more predominant, owing to the gradual establishment of large banks ready to acquire the failed banks. Also, from 1983, FDIC came up with Internal Deposit Transfers, which helped in alleviating the problem caused by Payouts during the Savings and Loan crisis. It is worth noting that, by the time of Great Recession in 2008, almost 90% of the failed banks were acquired, which could be attributed to the strong base of established larger banks during that time.
State Charter Banks are the ones which were most hit during the economic crises and during intermediate periods as well. Thus suggesting a high failure rate over the other Charter type banks.
Diminishing difference between a bank’s Total Assets and Total Deposits is a typical indicator of a bank running into failure.

Future Scope

Try to find out the reasons for high failure rate among particular states. This could be achieved by taking various factors like inflation, state tax, major revenue for every state etc. into consideration.
Collect historic data of all the banks existent since the Great Depression, and do logistic regression to predict bank failure.