Project Proposal: Exploratory Data Analysis of Bank failures since the Great Depression

Column

Synopsis

When does a bank fail?

A bank typically fails for one of the two reasons: Insolvency or Illiquidity. Insolvency occurs when the value of the assets held by a bank is less than the value of the liabilities(Deposits) held. Illiquidity occurs when a bank is not able to meet its current obligations as they come due.

What is the role of FDIC(Federal Deposit Insurance Corporation) in the event of bank failure?

In the event of a bank failure, the FDIC acts in two capacities. First, as the insurer of the bank’s deposits, the FDIC pays insurance to the depositors up to the insurance limit. Second, the FDIC, as the “Receiver” of the failed bank, assumes the task of selling/collecting the assets of the failed bank and settling its debts, including claims for deposits in excess of the insured limit.

The dataset holds information of all the banks in US which have failed from 1934 to present.

Problem Statement

What are the risk indicators which a depositor should take into consideration while depositing money into a bank?. The exploratory data analysis focuses on the typical risk indicators leading to a bank failure.

Solution Overview

Observing the typical characteristics of a bank failure using data visualization and summary statistics which will aid in guiding a depositor to choose a bank which most likely would not run into a failure.

Packages required

The following packages were required:

flexdashboard: Used to create interactive flex dashboards in R Markdown
dplyr: Used for data manipulation
tidyr: Used for tidying the data
stringr: Used for string manipulation
ggplot2: Used for data visualization
printr: Used to print R objects to appropriate formats
DT: Used to display R objects with DataTables
knitr: Used to display an aligned table on the screen

# Loading the libraries
library(flexdashboard)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(printr)
library(DT)
library(knitr)

Data Preparation

1.Data Import

Source: Commercial bank failures, 1934 - Present from Kaggle

This dataset contains 12 variables and 3484 observations of the failure of a commercial bank, savings association, and savings bank since the establishment of the FDIC in 1933. The bank failure report is available on the FDIC website. Comprehensive data on estimated losses are not available for FDIC-insured failures prior to 1986. Estimated loss is presented as “N/A” in years for which comprehensive information is not available.

# URL to the github repository
url <- "https://raw.githubusercontent.com/sayalidw93/Data-Wrangling/master/project%20data.csv"
# Read csv file
bank_raw <- read.csv(url, stringsAsFactors = FALSE)

# Display class of the dataset
class(bank_raw)

[1] "data.frame"

# Display variable names in the dataset
colnames(bank_raw)

 [1] "Financial.Institution.Number" "Institution.Name"            
 [3] "Institution.Type"             "Charter.Type"                
 [5] "Headquarters"                 "Failure.Date"                
 [7] "Insurance.Fund"               "Certificate.Number"          
 [9] "Transaction.Type"             "Total.Deposits"              
[11] "Total.Assets"                 "Estimated.Loss..2015."

# Display the dimensions of the dataset
dim(bank_raw)

[1] 3484   12

2. Codebook

Codebook is available on the FDIC website. Description of the variables in the dataset is as follows:

Financial Institution Number - Financial Institution Number is a unique number assigned to the institution as an Assistance Agreement, Conservatorship, Bridge Bank or Receivership.
Institution Name - This is the legal name of the institution.
Institution Type - Commericial Bank, Savings association and Savings Bank.
Charter Type - Legal authorization to conduct business granted by federal or state government.
Headquarters - The city and state of the headquarters of location.
Failure Date - The date that the failed institution ceased to exist as a privately held going concern.
Insurance Fund - Insurance funds used for insuring the depositors.
Certificate Number - The certificate number assigned by the FDIC used to identify institutions and for the issuance of insurance certificates.
Transaction Type - Transaction types used to resolve the institution failures.
Total Deposits - Total deposits to the institution which are filed by the institution prior to the failure date (includes demand deposits, money market deposits, other savings deposits, time deposits and deposits in foreign offices).
Total Assets - The total assets owned and filed by the institution prior to the failure date (includes cash, loans, securities, bank premises and other assets).
Estimated Loss(2015) - The estimated loss is the difference between the amount disbursed from the Deposit Insurance Fund to cover obligations to insured depositors and the amount estimated to be ultimately recovered from the liquidation of the receivership estate.

3. Data Cleaning

Missing values in the data:

I checked the data for missing values.

# Number of missing values for each variable
colSums(is.na(bank_raw))

Financial.Institution.Number             Institution.Name 
                         601                            0 
            Institution.Type                 Charter.Type 
                           0                            0 
                Headquarters                 Failure.Date 
                           0                            0 
              Insurance.Fund           Certificate.Number 
                           0                          485 
            Transaction.Type               Total.Deposits 
                           0                            0 
                Total.Assets        Estimated.Loss..2015. 
                         151                          975

Thus, there are missing values for Financial Institution Number, Certificate Number, Total Assets and Estimated Losses. As the variables Financial Institution Number and Certificate Number are not much of a concern in the exploratory analysis for this dataset, I will drop these two variables. However, for Total Assets and Estimated Losses I have decided not to remove any rows with missing data yet as these variables are significant for analysis. I will take care of this issue while doing individual analysis.

# Dropping insignificant variables
bank_raw <- bank_raw %>% select(-Financial.Institution.Number,-Certificate.Number)

Splitting Failure Date and Headquarter variables

For the ease of analysis and to find relation between the variables, I split the Headquarters variable into Headquarter city and State variables, and the Failure Date variable into Month, Date and Year variables.

# Splitting the Headquarter and Failure Date variables
banks <- bank_raw %>%
        separate(Headquarters, c("Headquarter city","State"), sep = ",") %>%
        separate(Failure.Date, c("Month","Date","Year"), sep = "/")

Imputing data

In the process of splitting the variables, a warning popped up for 2914th observation.

# Checking the record after the variables were split
banks[2914,]

	Institution.Name	Institution.Type	Charter.Type	Headquarter city	State	Month	Date	Year	Insurance.Fund	Transaction.Type	Total.Deposits	Total.Assets	Estimated.Loss..2015.
2914	BANK USA, N.A.	COMMERCIAL BANK	FEDERAL	KIHEI	MAUI	5	19	1995	BIF	ACQUISITION	9124	9048	2593

# Checking the record before the variables were split
bank_raw[2914,]

	Institution.Name	Institution.Type	Charter.Type	Headquarters	Failure.Date	Insurance.Fund	Transaction.Type	Total.Deposits	Total.Assets	Estimated.Loss..2015.
2914	BANK USA, N.A.	COMMERCIAL BANK	FEDERAL	KIHEI, MAUI, HI	5/19/1995	BIF	ACQUISITION	9124	9048	2593

Thus, we see that, the state variable is MAUI instead of HI(Hawaii). Comparing it with the original dataset, I decided to replace the Value “MAUI” with “HI” for this record, as the state variable is an important variable in our analysis.

# Imputing value in the State variable
banks$State <- str_replace(banks$State, "MAUI","HI")

4. Data Preview

In the data, each row is the failed bank and the columns represent the information for each failed bank. Dollar amount is in thousands for Total Assets, Total Deposits amd Estimated Losses.

# Get a sense of cleaned data
datatable(tail(banks,100))

5. Structure and Summary of Clean Data

# Create a datatable to display structure of clean data
Var.type <- lapply(banks,class)
Var.name <- colnames(banks)
strdata <- as_data_frame(cbind(Var.name,Var.type))
colnames(strdata) <- c("Variable Name","Variable Type")
kable(strdata)

Variable Name	Variable Type
Institution.Name	character
Institution.Type	character
Charter.Type	character
Headquarter city	character
State	character
Month	character
Date	character
Year	character
Insurance.Fund	character
Transaction.Type	character
Total.Deposits	integer
Total.Assets	integer
Estimated.Loss..2015.	integer

# Summary Statistics
Total.Deposits <- summary(banks$Total.Deposits)
Total.Assets <- summary(banks$Total.Assets)
Estimated.Loss <- summary(banks$Estimated.Loss..2015.)
# Merging the summary statistics for 3 variables in a data frame
summary_stats <- rbind(Total.Deposits,Total.Assets,Estimated.Loss)
# Display the summary statistics table
kable(summary_stats)

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	NA’s
Total.Deposits	0	12293	41721	299776.18	156241.5	188260793	0
Total.Assets	14	15051	47523	404588.41	173478.0	307021614	151
Estimated.Loss	-4386	3852	13354	70644.35	44585.0	12438005	975

The summary statistics table gives us an idea about Mean, Median, Minimum, Maximum, Missing values, 1st Quartile and 3rd Quartile for the numeric variables of interest. The negative Minimum value of Estimated loss may indicate that those banks had a slight margin for profit, however going forward estimated to run out of money in the near future, hence opted for early bankruptcy.Thus, those are not unusual values and are not deleted.

Proposed Exploratory Data Analysis

This section aims to explore the various risk indicators for a bank failure.

I intend on exploring the following relations :

Analyze the bank failure rate over the period of time using a barplot/time series plot.This will help the depositor in deciding if the bank is risky or if the bank failed because of adverse market conditions.
Also, analyze the state-wise bank failure rate over the period of time using barplots, heatmaps etc. This will give an idea which states were more likely to witness bank failures.
Institution Type wise bank failure rate over the given period of time using barplot. This helps in deciding which institution types had a high failure rate.
From the data it looks like the difference between the total assets and total deposits is minimal. I plan to create a new Difference variable and try to check its variation over the given period of time. Also, a time series plot of the two variables, total assets and total deposits will give a clearer picture of how these variables has had an impact on bank failure. Theoretically, the bookish value of total assets (Total Assets) is more than actual value of total assets at the time when the bank is declared bankrupt.
Analyze the variation of estimated losses with respect to the transaction type used to resolve a bank failure.

What do I not know now? :(

Still exploring possible ways of building a model which gives the relationship of the estimated cost of resolving the bank failure with the other significant variables.