Background:

I work as a branch bank manager and am responsible for the growth of deposit dollars for my location. I know that bank deposit information is publicly available, and wondered how geography (and more specifically populations) impacts deposit dollars at a given branch. I knew that I would be able to look at population data as well as part of my inquiry.

BIG IDEA:

What is the correlation between branch bank deposits and the geographical region the bank is located, and does the population of that area have an impact on the bank deposits for that region?

Data Import:

I Began with banking data from FDIC.gov and census data from the Census Bureau. I imported a 10 year period worth of data that includes deposits, locations, and various other measurements for all banks in the country. Next, I downloaded population estimates for each county in the U.S. over that same time period with the goal of combining the two data sets:

Data Wrangling/Tidying

This step ended up being a lot more involved than originally anticipated.

Though the data was overall very tidy, there was many obstacles I had to overcome with combining different data sets. I started by binding the different years into one data frame and for FDIC data and census data respectively:

pop_estimates_all<-rbind(pop_estimates_2019,pop_estimates_2018,pop_estimates_2017,pop_estimates_2016,
                    pop_estimates_2015,pop_estimates_2014,pop_estimates_2014,pop_estimates_2013,
                         pop_estimates_2012,pop_estimates_2011,pop_estimates_2010)

Slight variations in variable names in the census bureau data proved to be a problem that prevented the data sets from being combined. I ended up selecting the desired variables from each data set (by year) to determine with column headers did not match the other years so I could convert them to be standardized

And then bound the 10 data frames together:

clean_all<-rbind(clean_2019,clean_2018,clean_2017,clean_2016,clean_2015,clean_2014,clean_2013,clean_2012,clean_2011,clean_2010)

Census data was inflated due to the fact that population estimates were cited by county and by state, but was difficult to differentiate, and I removed all the state population estimates:

pop_estimates_all<-pop_estimates_all%>%filter(COUNTY>0)

My goal was to merge data sets, but that required variables to match exactly before the merge could take place. This also proved to be difficult becuase of small variations in the way the data is organized:

pop_estimates_all<-pop_estimates_all%>%select("CTYNAME","STNAME","POPESTIMATE","YEAR")

pop_estimates_all<-pop_estimates_all%>%rename(COUNTY=CTYNAME)

One example of this: Multnomah County vs. Multnomah. I used a function to align all the variable names to be congruent:

pop_estimates_all$COUNTY<-str_remove_all(pop_estimates_all$COUNTY, " County")

clean_all<-clean_all%>%rename("COUNTY"="CNTYNAMB")

clean_all<-clean_all%>%rename("STNAME"="STNAMEBR")

####With such a vast amount of data, I believed that this would work to my advantage when it came to having enough observations when looking at sub regions. This proved to be challenging when it came to visualizing the data and identifying outliers becuase of slight variations across data.

####Finally, I was able to merge the data after much trial and error. Getting to this point took much longer than I had anticipated:

clean_combined<-merge(x=clean_all,y=pop_estimates_all)

Exploration of Data

This data set is far to big to realistically just take a look at, so I began with summarizing to look for trends or concerns that I could tackle early on. A quick summary of the data reavealed what would become one of the biggest obstacles of this data set: the extreme amount of variation in the deposit dollars held at each respective branch.

Looking at the DEPSUMBR (deposit sum of branch) summary and then looking at the first and last 5 observations, I would need to address the wide variation in deposit dollars across different bank branches by excluding non-branch locations (that have a zero balance) and by segmenting the data.

summary(clean_combined$DEPSUMBR)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##         0     19447     40103    110814     76365 407675546

clean_combined%>%arrange(desc(DEPSUMBR))%>%head(5)

##         STNAME YEAR    COUNTY               ADDRESBR BKCLASS      CITYBR
## 1     New York 2019  New York     390 Madison Avenue       N    New York
## 2 South Dakota 2019 Minnehaha 101 N. Phillips Avenue       N Sioux Falls
## 3     New York 2017  New York     401 Madison Avenue       N    New York
## 4     New York 2018  New York     401 Madison Avenue       N    New York
## 5     New York 2016  New York     401 Madison Avenue       N    New York
##       DEPSUM  DEPSUMBR METROBR MICROBR                                 NAMEBR
## 1          0 407675546       1       0                Madison Ave And 47th St
## 2 1291135000 353866534       1       0 Wells Fargo Bank, National Association
## 3          0 348521473       1       0             Madison And 48th St Branch
## 4          0 326155491       1       0             Madison And 48th St Branch
## 5          0 313686851       1       0             Madison And 48th St Branch
##                                    NAMEFULL SIMS_LATITUDE SIMS_LONGITUDE ZIPBR
## 1 JPMorgan Chase Bank, National Association      40.75606      -73.97697 10017
## 2    Wells Fargo Bank, National Association      43.54766      -96.72684 57104
## 3 JPMorgan Chase Bank, National Association      40.68580      -73.94230 10017
## 4 JPMorgan Chase Bank, National Association      40.68580      -73.94230 10017
## 5 JPMorgan Chase Bank, National Association      40.68580      -73.94230 10017
##   BRSERTYP POPESTIMATE
## 1       11     1628706
## 2       11      193134
## 3       11     1630698
## 4       11     1629055
## 5       11     1635443

clean_combined%>%arrange(DEPSUMBR)%>%head(5)

##    STNAME YEAR  COUNTY                   ADDRESBR BKCLASS      CITYBR DEPSUM
## 1 Alabama 2010 Autauga       744 East Main Street      SM  Prattville      0
## 2 Alabama 2010 Baldwin 1301 South Mckenzie Street      SM       Foley      0
## 3 Alabama 2010 Baldwin         615 Mcmeans Avenue      NM Bay Minette      0
## 4 Alabama 2010 Calhoun          900 Wilmer Avenue      SM    Anniston      0
## 5 Alabama 2010  Clarke            102 Cobb Street      NM  Grove Hill      0
##   DEPSUMBR METROBR MICROBR                   NAMEBR                   NAMEFULL
## 1        0       1       0    Prattville Rdt Branch               Regions Bank
## 2        0       1       0             Foley Branch               Regions Bank
## 3        0       1       0       Bay Minette Branch                United Bank
## 4        0       1       0 Anniston Main Rdt Branch               Regions Bank
## 5        0       0       0       Cobb Street Branch First United Security Bank
##   SIMS_LATITUDE SIMS_LONGITUDE ZIPBR BRSERTYP POPESTIMATE
## 1      32.46007      -86.45584 36067       23       54773
## 2      30.39211      -87.68327 36535       11      183112
## 3      30.88357      -87.78754 36507       11      183112
## 4      33.65574      -85.82848 36201       23      118408
## 5      31.70993      -87.77676 36451       23       25766

This data includes all banking locations, including non-branch mortgage offices and other administrative locations. I used the bank class variable to eliminate those observations that are not standard retail bank branches

##Code that defines the type of service the branch office provides: ###11- Full Service, brick and mortar office ###12- Full Service, retail office ####13- Full Service, cyber office ####21- Limited Service, administrative office ####22- Limited Service, military facility ####23- Limited Service, drive-through facility ####24- Limited Service, loan production office ####25- Limited Service, consumer credit office ####26- Limited Service, contractual office ####27- Limited Service, messenger office ####28- Limited Service, retail office ####29- Limited Service, mobile/seasonal office ####30- Limited Service, trust office"

ggplot(clean_combined,aes(BRSERTYP))+geom_bar()

There are about 34,000 observations of branch sum being zero, which only equates to about 3% of the total data. For the sake of data cleaning, I want to exclude these locations becuase a zero balance most likely indicates a branch that is in the process of opening/closing, or a data entry error

clean_combined%>%filter(DEPSUMBR==0)%>%count()

##       n
## 1 34216

Luckily, this visualization revealed that there were not many non-retail locations in this data set, so I felt comfortable making the change.

clean_combined<-clean_combined%>%mutate(BKCLASS=as.factor(BKCLASS))

clean_combined_branches<-clean_combined%>%filter(BRSERTYP==c(11,12))%>%filter(DEPSUMBR>0)

## Warning in BRSERTYP == c(11, 12): longer object length is not a multiple of
## shorter object length

A distribution of the total deposits reveals a very significant number of outliers, even with a log fucntion added

ggplot(clean_combined_branches,aes(x="",y=log10(DEPSUMBR)))+geom_boxplot()

I created a function to remove outliers and created an additional set of data without outliers to remove some of the variation

outliers<-boxplot(clean_combined_branches$DEPSUMBR)$out

#I will use the function below to remove any of these observations that are considered outliers:

outlier_test1<-clean_combined_branches[-which(clean_combined_branches$DEPSUMBR %in% outliers),]

  
clean_combined_branches_nooutliers<-outlier_test1

ggplot(outlier_test1,aes(x="",y=DEPSUMBR))+geom_boxplot()

In order to accomplish a meaningful analysis, I would need to subset the data by deposits to help adjust for the variation that I am still seeing in the branch deposits after removing the outliers. I also want to subset the states into different regions as well.

I used the ifelse function to create binary variables for branch deposits in $50 million increments. I did this for both data sets (outliers removed and outliers not removed)

There is definitely a tradeoff here as it is clear that the highest branch deposits are much lower after removing the outliers. I kept both data sets to be usable though

bank_data%>%arrange(desc(DEPSUMBR))%>%head(5)

##       STNAME YEAR      COUNTY                  ADDRESBR BKCLASS          CITYBR
## 1   Virginia 2014    Caroline     211 North Main Street      SM   Bowling Green
## 2 California 2019 Los Angeles      3500 East 7th Street       N      Long Beach
## 3 New Jersey 2015      Bergen           71 Union Avenue       N East Rutherford
## 4  Tennessee 2018    Hamilton 1969 Northpoint Boulevard      SM          Hixson
## 5 California 2014   San Diego  8222 Mira Mesa Boulevard       N       San Diego
##   DEPSUM DEPSUMBR METROBR MICROBR                      NAMEBR
## 1      0   164036       1       0        Bowling Green Branch
## 2      0   164033       1       0 3500 East 7th Street Branch
## 3      0   164033       1       0      East Rutherford Branch
## 4      0   164029       1       0       Northgate Mall Branch
## 5      0   164028       1       0            Mira Mesa Branch
##                                    NAMEFULL SIMS_LATITUDE SIMS_LONGITUDE ZIPBR
## 1                   Union First Market Bank      38.05221      -77.34801 22427
## 2            CIT Bank, National Association      33.77533     -118.15104 90804
## 3             TD Bank, National Association      40.82679      -74.09695  7073
## 4                             SunTrust Bank      35.13670      -85.23980 37343
## 5 JPMorgan Chase Bank, National Association      32.91274     -117.14599 92126
##   BRSERTYP POPESTIMATE d0_50MM d50_100MM d100_150MM d150_200MM d200_250MM
## 1       11       29741       1         0          0          0          0
## 2       11    10039107       1         0          0          0          0
## 3       11      926391       1         0          0          0          0
## 4       11      364293       1         0          0          0          0
## 5       11     3248877       1         0          0          0          0
##   d250_300MM d300_350MM d350_400MM d400_450MM d450_500MM
## 1          0          0          0          0          0
## 2          0          0          0          0          0
## 3          0          0          0          0          0
## 4          0          0          0          0          0
## 5          0          0          0          0          0

Looking at some visualizations, I summed total deposits across all banks by region and came up with these graphs:

ggplot(bank_data,aes(YEAR,DEPSUMBR))+geom_col()+facet_wrap(~regions)

## Error: At least one layer must contain all faceting variables: `regions`.
## * Plot is missing `regions`
## * Layer 1 is missing `regions`

For some reason, 2014 is significantly higher than the rest of the years, and my only assumption is that there is a problem with the data set, so I remove 2014 from the data in order to have a more accurate reading.

ggplot(bank_data1,aes(YEAR,DEPSUMBR))+geom_col()+facet_wrap(~regions)

## Error: At least one layer must contain all faceting variables: `regions`.
## * Plot is missing `regions`
## * Layer 1 is missing `regions`

I also created a facet of the total bank deposits faceted for whether the bank was located in a core urban area of 50K people or more (METROBR) as well as whether it is located in a core urban area of >10k but <50K population

ggplot(bank_data1,aes(YEAR,DEPSUMBR))+geom_col()+facet_wrap(~MICROBR)+ggtitle("sum of deposits in population area of greater than 10k but less than 50K")

ggplot(bank_data1,aes(YEAR,DEPSUMBR))+geom_col()+facet_wrap(~METROBR)+ggtitle("sum of deposits in population area of 50K or more")

This quick chart shows that the sum of bank deposits in greater density population areas are significantly higher than those that are not. This aligns with the hypothesis of what I had predicted

The chart above alludes to the idea that areas with high deposits are more densely populated. Before creating linear regression models, I want to try to create subsets of the population estimates for counties (POPESTIMATES) so that similarly sized counties are in the same groupings.

summary(bank_data$POPESTIMATE)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      451    75790   309323   843334   904962 10105708

According to the US Census, counties are separated into groupings based on population sizes according to the following. So I created a subsetting of the data so I can compare banks in similarly sized population areas:

0-4999 5000-19,999 20,000-99,999 100,000-499,999 500,000-999,999 1,000,000+

Working again with the full data set with outliers, I will add subsets to the population estimates now in addtion to the deposit subsets I created before

I created a subset of dataframes that chunk my data into different county population groupings as outlined by the Census bureau

pop_0_5_data<-clean_combined_branches_popsegments%>%filter(pop_0_5==1)

## Error in eval(lhs, parent, parent): object 'clean_combined_branches_popsegments' not found

pop_5_20_data<-clean_combined_branches_popsegments%>%filter(pop_5_20==1)

## Error in eval(lhs, parent, parent): object 'clean_combined_branches_popsegments' not found

pop_20_100_data<-clean_combined_branches_popsegments%>%filter(pop_20_100==1)

## Error in eval(lhs, parent, parent): object 'clean_combined_branches_popsegments' not found

pop_100_500_data<-clean_combined_branches_popsegments%>%filter(pop_100_500==1)

## Error in eval(lhs, parent, parent): object 'clean_combined_branches_popsegments' not found

pop_500_1000_data<-clean_combined_branches_popsegments%>%filter(pop_500_1000==1)

## Error in eval(lhs, parent, parent): object 'clean_combined_branches_popsegments' not found

pop_1000_up_data<-clean_combined_branches_popsegments%>%filter(pop_1000_up==1)

## Error in eval(lhs, parent, parent): object 'clean_combined_branches_popsegments' not found

Attemping to find a predictive model using R^2

Having lots of subsets of my original data, I input some basic linear regression models in the pursuit of finding an R^2 that could give me some predictive value. With the underlying idea of trying to find the predictive power of population size on bank deposits, I began there and expanded the modeling to create different models varying independent variables.

The first bivariate model INCLUDED the outliers:

summary(lm(DEPSUMBR~POPESTIMATE,data=cleaned_combined_branches_segmented))

## 
## Call:
## lm(formula = DEPSUMBR ~ POPESTIMATE, data = cleaned_combined_branches_segmented)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
##   -299088    -85350    -66125    -33584 407546429 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.644e+04  3.507e+03   27.50   <2e-16 ***
## POPESTIMATE 2.006e-02  1.839e-03   10.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2109000 on 472371 degrees of freedom
## Multiple R-squared:  0.0002518,  Adjusted R-squared:  0.0002497 
## F-statistic:   119 on 1 and 472371 DF,  p-value: < 2.2e-16

This predictive model suggests that the average bank will have $96,440K ($96MM) that for every 1 person increase in population in a given area, the total branch bank deposits will be (on average) increase by 2.006e-02 or $20. The r^2 value is VERY low and it is safe to say that this cannot be used as a good predictive measure of overall branch bank deposits at each respective location.

The next model is another bivariate model, but it is adjusted to remove the outliers:

summary(lm(DEPSUMBR~POPESTIMATE, data=bank_data1))

## 
## Call:
## lm(formula = DEPSUMBR ~ POPESTIMATE, data = bank_data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83721 -26909  -8944  18742 118479 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.504e+04  2.062e+02   218.4   <2e-16 ***
## POPESTIMATE 3.839e-03  1.177e-04    32.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35950 on 39468 degrees of freedom
## Multiple R-squared:  0.02623,    Adjusted R-squared:  0.0262 
## F-statistic:  1063 on 1 and 39468 DF,  p-value: < 2.2e-16

The interpretation of this model is that average bank deposits are around $45,040K ($45MM) and for every one person increase in population, there will be approximately a $3.839 increase in overall branch bank deposits at each respective location.

This is obviously not a good fit, becuase we know that not every bank has that exact amout of deposits, and becuase the amount of variation in the data set and the total number of factors. The r^2 value implies that this model only has about 2% predictive power.

I continue to add variables and use the different subsets to find a model that has a better predictive value. Since r^2 = 0.02, anything higher than that is a better fit

Firstly, I used the dataframes that were subsetted by population to run regressions on

Conclusion and Insights

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Final Project Foundations of Data Science

Owen Dodson

12/3/2020