Overview

Financial wellness can be defined in different ways for the purposes of this analysis Financial wellness will be defined as an individuals relationship with money. This definition refers to how secure your money is, given all the variables involving an unknown future. Financial Education Council.

This is a broad definition but it includes several key concepts such as: - Preparation for emergencies - Active short term and long term plans and goals - Budgeting and attitudes towards money - Feelings towards your finances

Intuitively I understand the importance of financial wellness to an individuals immediate financial stability, their long-term aggregation of wealth and financial sustainability in retirement. These benefits are well documented, studied and highlighted by financial advisers and wealth management firms. However the impact of financial wellness on your emotional and mental health should not be ignored our understated. The stress, anxiety, depression, sleep deprivation or even lack of motivation that comes with poorly handled money matters will impact all aspects of your life.

When people are in control of their money, they’re in control of their lives. They’re happier, less anxious and more empowered. nudge

Few topics will have such a large impact on an individuals current and future life from both an economic and a mental health perspective. Key to developing financial wellness in the general population is understanding how to measure it and what factor impact it. The goal of this project is to explore the measurements of financial wellness and the factors that determine it. I have been interested in measures and predictors of financial wellness for several years.

For this project I have identified two data sets that share a common measure of financial wellbeing. The Consumer Financial Protection Bureau has developed a measurements of financial wellbeing called the Financial Wellbeing Scale. The Federal Reserve System utilizes this same measurement for financial wellbeing in their annual household decision maker survey.

Data Sources:

There are numerous studies of financial wellness sponsored by financial firms, industry groups and government agencies. I choose to focus on two studies:

Consumer Financial Protection Bureau - Financial Wellness Survey Financial well-being data The Public Use File (PUF) survey results can be accessed as a csv file Financial well-being survey data
Federal Reserve System - Survey of Household Economic and Decision making Federal Reserve Household Economic Decisions The SHED survey results can be accessed as a zip file Survey of Household Economics and Decisionmaking 2020

These are comprehensive studies that are conducted periodically using similar methodology. Both studies survey the population to gather demographic, financial and behavioral data from individuals. Both studies include the same measure of financial wellbeing that was defined and operationalized by the Consumer Financial Protection Bureau called the Financial Well-Being Scale. The Financial Well-Being Scale was created as part of the Consumer Financial Protection Bureau to encourage Financial Wellness.

The score is calculated using 10 questions survey:

How well does this statement describe you or your situation?

I could handle a major unexpected expense
I am securing my financial future
Because of my money situation, I feel like I will never have the things I want in life
I can enjoy life because of the way I’m managing my money
I am just getting by financially
I am concerned that the money I have or will save won’t last

How often does this statement apply to you?

Giving a gift for a wedding, birthday or other occasion would put a strain on my finances for the month*
I have money left over at the end of the month
I am behind with my finances
My finances control my life

Consumer Financial Protection Bureau - Financial Wellness Survey

The data was collected as part of the Consumer Financial Protection Bureau’s (CFPB) National Financial Well-Being Survey Public Use File (PUF). The PUF is a data set containing:

demographic data
assessment of financial knowledge
financial positions
financial products owned
financial well-being scale

The National Financial Wellbeing Survey was conducted in English and Spanish digitally / online between October 27, 2016 and December 5, 2016. Overall, 6,394 surveys were completed: 5,395 from the general population sample and 999 from an oversample of adults aged 62 and older. The survey was designed to represent the adult population of the 50 U.S. states and the District of Columbia. The survey was fielded on the GfK KnowledgePanel®. The KnowledgePanel sample is recruited using address-based sampling and dual-frame landline and cell phone random digit dialing methods.

The PUF was published in 2017.

Federal Reserve System - Survey of Household Economic and Decision making (SHED)

The data was collected as part of the Report on the Economic Well‐Being of U.S. Households in 2020,and examines the economic well-being and financial lives of U.S. adults and their families.

The SHED includes:

demographic data
assessment of financial knowledge
financial products owned
financial positions
economic fragility
financial well-being scale

The SHED is sponsored by the Board of Governors of the Federal Reserve System. The surveys that generate the data is collected by Ipsos using their online probability based KnowledgePanel. The 2020 survey of more than 11,000 adults was conducted in November 2020, offering a picture of how people were faring eight months after the onset of the COVID-19 pandemic.

The SHED was published in 2020.

Setup

set.seed(1234)
rm(list=ls())

# Exported data load, tidy function and tibble cleaning
fun_file_name <- glue(getwd(), "/ProjectFinal/DATA607_Functions.R")
source(fun_file_name, local = knitr::knit_global())

Workflow

This project is based on the OSEMN analysis Model. The OSEMN model is defined by the following steps:

Obtain - Source a sufficient corpus of usable data to conduct the desired analysis
Scrub - Preparing the data for further analysis by restructuring, cleaning, and filtering the data set
Explore - Conduct an initial exploration of the data to better understand the cases and the variables
Model - Fit the model to the data once it has been cleaned, explored and properly formatted
iNterpret - Explore the output of the model to understand the accuracy

OSEMN analysis Model

Obtain

The data for the analysis was obtained from the following sources:

Consumer Financial Protection Bureau - Financial Wellness Survey: The PUF survey results can be accessed as a csv file Financial well-being survey data
Federal Reserve System - Survey of Household Economic and Decision making: The SHED survey results can be accesssed as a zip file Survey of Household Economics and Decisionmaking 2020

# load raw data
cfpbRaw_df <- getRawCFPBFile()
fedRaw_df <- getRawFedFile()


# calculate sample stats for each survey
s_df <- tibble("name"=character(), "vars"=double(), "obs"=double(), "cfpb_mean"=double(),
               "cfpb_median"=double(), "cfpb_std"=double() )
s_df <- s_df %>% add_row(name="cfpb", vars=ncol(cfpbRaw_df) ,obs=nrow(cfpbRaw_df),
                        cfpb_mean=round(mean(cfpbRaw_df$FWBscore),2), 
                        cfpb_median=round(median(cfpbRaw_df$FWBscore),2), 
                        cfpb_std=round(sd(cfpbRaw_df$FWBscore),2))
s_df <- s_df %>% add_row(name="fed", vars=ncol(fedRaw_df) ,obs=nrow(fedRaw_df),
                        cfpb_mean=round(mean(fedRaw_df$CFPB_score),2), 
                        cfpb_median=round(median(fedRaw_df$CFPB_score),2), 
                        cfpb_std=round(sd(fedRaw_df$CFPB_score),2) )

s_df %>%
  kbl() %>%
  column_spec(1, bold = T, border_right = T) %>%
  kable_paper(bootstrap_options = "striped", fixed_thead = T,full_width = F)

name	vars	obs	cfpb_mean	cfpb_median	cfpb_std
cfpb	217	6394	56.03	56	14.15
fed	372	11648	56.32	55	14.85

# cfpb
ggplot(cfpbRaw_df,
    aes(FWBscore, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(FWBscore)), linetype = "dashed", color="red", size = 0.5) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (cfpb dataset)")

summary(cfpbRaw_df$FWBscore)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -4.00   48.00   56.00   56.03   65.00   95.00

# fed
ggplot(fedRaw_df,
    aes(CFPB_score, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(CFPB_score)), linetype = "dashed", color="red", size = 0.5) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (federal reserve data set)")

summary(fedRaw_df$CFPB_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   47.00   55.00   56.32   66.00   90.00

The cfpb scores from the Federal Consumer Protection Bureau and the Federal Reserve System appear to follow a normal distribution with similar sample statistics for mean median and standard deviation.

observations

The Financial Wellbeing Scaled score has consistent stats across both surveys
The Survey’s are comprehensive and will require filtering and variable selection to create a manageable data set of cases and variables.
Most of the variables are categorical
The CFPB survey data is coded based on the survey responses. To be usable the data can be converted to factor variables that capture the survey code and the definition
The survey response data requires the associated code books to decipher the variables in the file.

Scrub

The scrub step in the process leverages the two code books for the survey to filter, translate and rename the variables.

Variable Selection - To identifying a subset of the variable to use in the analysis I relied on existing theory, survey findings and variables that I was interested in exploring. Additional / different variables could be used in subsequent analysis to explore other theories.
Transforming - To transform the survey data I relied on the code books for each surveys to better understand the attribute and response coding. The Fed survey results included full text answer in a non UTF-8 format. The data was transformed to a UTF-8 format and truncated for readability.
Filtering - To reduce the compute time and make the analysis manageable I took a random subset of the data ensuring that preserved the breakdown of cfpb score categories from the full data set.
Outliers - there are a small number of records with a negative fin well scale score. Given the methodology for calculating the score these observations are anomalies and were deleted from the data set.

# get cfpb file
cfpb_df <- getCFPBFile()

## The following `from` values were not present in `x`: -1, 1, 2, 3, 4, 5, 6, 7, 98, 99

cfpb_df <- cfpb_df %>% filter( cfpb_score > 0)

cfpb_df$cfpb_score_4cat <- cut(cfpb_df$cfpb_score, breaks = c(-10, 40, 60, 80, 100),
                           labels = c("< 40","40-60","60-80","80-100"),
                           right = FALSE,
                           include.lowest=TRUE) 
s_df <- tibble("cfpb" = levels(count(cfpb_df$cfpb_score_4cat)$x))
s_df <- add_column(s_df,"cfpb_freq" = count(cfpb_df$cfpb_score_4cat)$freq)
cfpb_df <- cfpb_df %>% filter(cfpb_score >= 0)

# reduce cfpb dataset
cfpb_df <- slice_sample(cfpb_df, weight_by=cfpb_score_4cat ,n=4000) 
s_df <- add_column(s_df,"cfpb_filter" = levels(count(cfpb_df$cfpb_score_4cat)$x))
s_df <- add_column(s_df,"cfpb_filter_freq" = count(cfpb_df$cfpb_score_4cat)$freq)


# get fed file
fed_df <- getFedFile()
fed_df$cfpb_score_4cat <- cut(fed_df$cfpb_score, breaks = c(-10, 40, 60, 80, 100),
                           labels = c("< 40","40-60","60-80","80-100"),
                           right = FALSE,
                           include.lowest=TRUE) 
s_df <- add_column(s_df,"fed" = levels(count(fed_df$cfpb_score_4cat)$x))
s_df <- add_column(s_df,"fed_freq" = count(fed_df$cfpb_score_4cat)$freq)

# reduce cfpb dataset
fed_df <- slice_sample(fed_df, weight_by=cfpb_score_4cat ,n=4000) 
s_df <- add_column(s_df,"fed_filter" = levels(count(fed_df$cfpb_score_4cat)$x))
s_df <- add_column(s_df,"fed_filter_freq" = count(fed_df$cfpb_score_4cat)$freq)

The survey results were sampled using a consistent weight for cfpb score. The resulting dataset includes 4000 surveys from each survey data set.

s_df   %>% kbl() %>%
  kable_paper(bootstrap_options = "striped", fixed_thead = T,full_width = F)

cfpb	cfpb_freq	cfpb_filter	cfpb_filter_freq	fed	fed_freq	fed_filter	fed_filter_freq
< 40	728	< 40	273	< 40	1349	< 40	230
40-60	3152	40-60	1849	40-60	5769	40-60	1775
60-80	2186	60-80	1599	60-80	3566	60-80	1504
80-100	323	80-100	279	80-100	964	80-100	491

dim(cfpb_df)

## [1] 4000   24

dim(fed_df)

## [1] 4000   43

# cfpb
ggplot(cfpb_df,
    aes(cfpb_score, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", color="red", size = 0.5) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (cfpb dataset)")

summary(cfpb_df$cfpb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   50.00   59.00   58.58   67.00   95.00

# fed
ggplot(fed_df,
    aes(cfpb_score, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", color="red", size = 0.5) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (federal reserve data set)")

summary(fed_df$cfpb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   50.00   59.00   60.05   70.00   90.00

observations

Sampling - the sampling of the two survey data sets did not impact the distribution of the cfpb score in either data set.
Outliers - the data includes outlier score below 20 and above 90. Since this distribution is likely a reflection of the population these records were retained in the data set.

Explore

The survey data is comprehensive with 217 variables in the CFPB Survey and 372 variables in the Federal Reserve survey. To reduce the solution space the survey data was pre-filtered to only include variables that I thought would be likely to contribute to the effectiveness of the model. A subset of the variable exploration is included in this document additional visualization and analysis can be found in the accompanied DATA607_Plot document and the DATA607_STEP document.

Consumer Financial Protection Bureau - Survey

econ_save_rate (SAVINGSRANGES) - How much money do you have in savings today…?
house_mortgage (MORTGAGE) - What do you owe on your home today?
age_8cat (agecat) - Age in 8 categories
econ_hh_income (PPINCIMP) - Household Income

Federal Reserve System - Survey

age_7cat (ppagecat) - Age in 7 Categories
econ_saving (ppfs0596) - Q22: What is the approximate total amount of your household’s savings and invest
econ_inc_4cat (inc_4cat_50k) - Income (I40) ‐ 4 cat ‐ <25,25‐49,50‐99,100+
econ_fin_ok (atleast_okay) - Doing okay financially
econ_pay_exp400 (pay_casheqv) - Would handle $400 expense with cash or equivalent
econ_skip_med (skip_medical) - Went without medical care

Consumer Financial Protection Bureau

From the initial exploration of the dependent variable, Financial Well-Being Scale (cfpb_score) follows the normal distribution with a cluster of results on the high and low end of scale. The median and the mean are similar and the the distribution appears to be symmetric. The structure of the distribution suggest that there might be an opportunity to explore the top range of the distribution and the lower end of the distribution separately. This seems consistent with how the poor and the wealthy experience and navigate the economy.

Each of the survey questions includes the option not to provide answer. This creates incomplete data sets for a small proportion of the survey responses. The records for these incomplete cases where not filtered out until the specific model required complete records.

#econ_save_rate + house_mortgage + age_8cat + econ_hh_income

cfpb_tmp_df <- cfpb_df %>% select(cfpb_score, econ_save_rate, house_mortgage, age_8cat, econ_hh_income)
summary(cfpb_tmp_df)

##    cfpb_score              econ_save_rate            house_mortgage
##  Min.   :14.00   $5,000-19,999    :708    Question not asked:1270  
##  1st Qu.:50.00   Prefer not to say:633    Refused           :  36  
##  Median :59.00   $75,000 or more  :609    Less than $50,000 :1040  
##  Mean   :58.58   $1,000-4,999     :567    $50,000-199,999   : 878  
##  3rd Qu.:67.00   $20,000-74,999   :564    $200,000 or more  : 384  
##  Max.   :95.00   $100-999         :397    I don't know      :  68  
##                  (Other)          :522    Prefer not to say : 324  
##     age_8cat                econ_hh_income
##  62-69  :684   $100,000 to $149,999:719   
##  25-34  :671   $75,000 to $99,999  :612   
##  45-54  :656   $150,000 or more    :599   
##  75+    :493   $60,000 to $74,999  :423   
##  35-44  :465   Less than $20,000   :371   
##  55-61  :460   $30,000 to $39,999  :358   
##  (Other):571   (Other)             :918

Observations

Low correlation across numeric values - There are limited numeric values in the survey data. The numeric data that exists is not heavily correlated. This is somewhat surprising given that at least 4 of the numeric values are defined to reflect financial knowledge or financial wellbeing. The only two variable that display a correlation above 0.5 are the cfpb score for scaled financial wellness (FWBscore) and the Lusardi and Mitchell financial knowledge Summative scale scale scorescore (LMscore).
Correlation varies by education and income level - Even though the overall correlation between numeric variables is low there is a difference in correlation across education levels and income groups. Segmenting the population using these variables is an avenue for further analysis.

ggscatmat(data=cfpb_df, corMethod = "spearman", alpha=0.2)

## Warning in ggscatmat(data = cfpb_df, corMethod = "spearman", alpha = 0.2):
## Factor variables are omitted in plot

ggscatmat(data=cfpb_df, color="edu_level", corMethod = "spearman", alpha=0.2)

## Warning in ggscatmat(data = cfpb_df, color = "edu_level", corMethod =
## "spearman", : Factor variables are omitted in plot

ggscatmat(data=cfpb_df, color="econ_hh_income", corMethod = "spearman", alpha=0.2)

## Warning in ggscatmat(data = cfpb_df, color = "econ_hh_income", corMethod =
## "spearman", : Factor variables are omitted in plot

explore specific variables

The top categories for house values, morgage and income start at $150,000, $200,000 and $150,000. The lowest categories for house value, mortgage and income include $150,000, $50,000 and $20,000. The lowest categories and highest categories for these variables likely exhibit different behavior and economic engagement.

# econ_hh_income 
ggplot(cfpb_df, aes(x=econ_hh_income)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "econ_hh_income")

ggplot(data = cfpb_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=econ_hh_income, color=econ_hh_income)) +
  labs (title = "econ_hh_income")

ggplot(cfpb_df,
    aes(cfpb_score, fill = econ_hh_income, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by econ_hh_income)")

The initial review of the comparative visualizations for race and cfpb score seem to indicate that race may not be a primary driver of financial wellbeing. There is some correlation between race and wellness score but the strength of that relationship is not directly evident in the following graphs.

# race
ggplot(cfpb_df, aes(x=race)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "race")

ggplot(data = cfpb_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=race, color=race)) +
  labs (title = "race")

ggplot(cfpb_df,
    aes(cfpb_score, fill = race, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by race)")

The distribution of responses for the house mortgage survey question seem to reflect an unwillingness by survey respondents to share these details with the pollster. Just over 1700 survey respondents declined to answer the question.

# house_mortgage
ggplot(cfpb_df, aes(x=house_mortgage)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "house_mortgage")

ggplot(data = cfpb_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=house_mortgage, color=house_mortgage)) +
  labs (title = "house_mortgage")

ggplot(cfpb_df,
    aes(cfpb_score, fill = house_mortgage, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by house_mortgage)")

count(cfpb_df$house_mortgage)

##                    x freq
## 1 Question not asked 1270
## 2            Refused   36
## 3  Less than $50,000 1040
## 4    $50,000-199,999  878
## 5   $200,000 or more  384
## 6       I don't know   68
## 7  Prefer not to say  324

Federal Reserve System

The data set from the Federal Reserve survey includes several questions that produce variable with overlapping measures. This includes the variables that measure education level, employment, house hold composition and economic circumstances. As part of the modeling selection process I will look to rationalize the list of variables so that I do not include several variables that measure the same underlying characteristic. Similar to the survey data from the CFPB a portion of the respondent declined to answer specific questions. I will filter out incomplete cases as required by the model being used.

Observations

Few numeric variables with limited correlations. - similar to the CFPB survey data the Federal Reserve survey data reflects low correlation between the numeric variables in the data set. It should also be noted that the FED data set includes very few unique numeric variable. The overwhelming proportion of survey questions are designed around answer that can be placed in categories. The numeric numbers that exist primarily focus on house hold composition.
survey data includes non complete records - the second phase of the analysis would be to identify smaller subset of the variable and review the impact of incomplete data sets and outliers.

# age_7cat + econ_saving + econ_inc_4cat + econ_fin_ok + 
#     econ_pay_exp400 + econ_skip_med


#fed_tmp_df <- fed_df %>% select(cfpb_score, econ_saving, econ_fin_ok, econ_pay_exp400, econ_skip_med, edu, credit_guess)
fed_tmp_df <- fed_df %>% select(cfpb_score, age_7cat, econ_saving, econ_inc_4cat, econ_fin_ok, econ_pay_exp400, econ_skip_med) #0.4675

summary(fed_tmp_df)

##    cfpb_score     age_7cat                econ_saving  
##  Min.   :19.00   18-24:252   Under $50,000      :1175  
##  1st Qu.:50.00   25-34:646                      : 796  
##  Median :59.00   35-44:557   $100,000 - $249,999: 455  
##  Mean   :60.05   45-54:528   $50,000 - $99,999  : 392  
##  3rd Qu.:70.00   55-64:845   $1,000,000 or more : 339  
##  Max.   :90.00   65-74:786   $250,000 - $499,999: 335  
##                  75+  :386   (Other)            : 508  
##            econ_inc_4cat   econ_fin_ok   econ_pay_exp400 econ_skip_med 
##  $100,000 or more :1419   No     : 669   No     :1001    No     :3272  
##  $25,000-$49,999  : 671   Refused:   4   Refused:  22    Refused:  20  
##  $50,000-$99,999  :1061   Yes    :3327   Yes    :2977    Yes    : 708  
##  Less than $25,000: 810                                                
##  Refused          :  39                                                
##                                                                        
##

The highest correlation in the numeric data is for children age 13 - 17 and over 18 years. These variables have a high correlations with cfpb score but they do not contributes substantially to the model .

ggscatmat(data=fed_df, corMethod = "spearman", alpha=0.2)

## Warning in ggscatmat(data = fed_df, corMethod = "spearman", alpha = 0.2): Factor
## variables are omitted in plot

ggscatmat(data=fed_df, color="race", corMethod = "spearman", alpha=0.2)

## Warning in ggscatmat(data = fed_df, color = "race", corMethod = "spearman", :
## Factor variables are omitted in plot

ggscatmat(data=fed_df, color="gender", corMethod = "spearman", alpha=0.2)

## Warning in ggscatmat(data = fed_df, color = "gender", corMethod = "spearman", :
## Factor variables are omitted in plot

ggscatmat(data=fed_df, color="age_7cat", corMethod = "spearman", alpha=0.2)

## Warning in ggscatmat(data = fed_df, color = "age_7cat", corMethod =
## "spearman", : Factor variables are omitted in plot

explore specific variables

The Economic circumstance variables (econ_fin_ok, econ_pay_exp400, econ_skip_med) are heavily represented in the federal reserve model. This is not surprising given the nature of the questions that define the cfpb score. It indicate the survey respondents are aware of their current financial wellness circumstance. As you can see from the grouped histogram there is distinct shift in the distribution upwards for survey respondents who feel they are on firmer footing financial. The alignment of the decision to forgo medical care based on on cfpb score highlights some real world impacts financial wellness. Finally it should be noted that there are cluster of outlier data at the high end and lower end of the distribution.

# econ_fin_ok
ggplot(fed_df, aes(x=econ_fin_ok)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "econ_fin_ok")

ggplot(data = fed_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=econ_fin_ok, color=econ_fin_ok)) +
  labs (title = "econ_fin_ok")

ggplot(fed_df,
    aes(cfpb_score, fill = econ_fin_ok, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by econ_fin_ok)")

# econ_skip_med
ggplot(fed_df, aes(x=econ_skip_med)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "econ_skip_med")

ggplot(fed_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=econ_skip_med, color=econ_skip_med)) +
  labs (title = "econ_skip_med")

ggplot(fed_df,
    aes(cfpb_score, fill = econ_skip_med, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by econ_skip_med)")

# econ_pay_exp400
ggplot(fed_df, aes(x=econ_pay_exp400)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "econ_pay_exp400")

ggplot(data = fed_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=econ_pay_exp400, color=econ_pay_exp400)) +
  labs (title = "econ_pay_exp400")

ggplot(fed_df,
    aes(cfpb_score, fill = econ_pay_exp400, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by econ_pay_exp400)")

Model

This analysis will focus on a linear regression model as the foundation of the conclusions. However one of the goals for this project was to explore different data science models. To these ends I will touch on a random forest model and neural network model that will utilize the same data set to generate a predictive model.

The analysis includes the following models:

linear regression model using parsnip packages from tidy models
random forest model using parsnip packages from tidy models
neural network model using the neuralnet package

Preconditions for Linear Models

The residual analysis from the CFPB model and Federal Reserve model roughly adheres to the following pre-conditions for linear modeling. More analysis would be required to ensure independence and a linear relationship to the dependent variable. It would be difficult to assert full independence of the predictor variables given that each case reflect an individual. Thre are numerouse studies that link demographics, race, geography and age to economic outcomes.

residuals are nearly normal
residuals have constant variability
residuals are independent
each variable is linearly related to the outcome

Model Definition and Variable Selection

The analysis started with a review of the survey code book to identify candidate variables for the analysis. Then relying on the findngs from each survey, and industry knowledge I conducted an initial variable selection. This selection process was used as a starting point, additional analysis could be conducted to refine the initial lists.

24 variables from the CFPB Survey.

house hold structure
housing economics and ownership characteristics
income
demographic
geographic
measures of financial knowledge and financial wellness

43 variable from the Fed Survey file were selected.

house hold structure
housing economics and ownership characteristics
income
demographic
geographic
measures of financial knowledge and financial wellness
assessment of econmic situation

As an extension of the variable exploration stepwise regression was used to optimize the model variables based on their contribution to the model accuracy. I started using manual stepwise variable elimination process but opted for automated approach to simplify the analysis. The two approaches converged on models with similar accuracy and complexity but included different variables.

I used three automated model selection methods from the tidyverse packages for stepwise regression, Akaike information criterion(AIC) and Bayesian information criterion(BIC). The three methods of stepwise regression generated several candidates for the final model. I selected 2 models that although did not have the highest r-squared seemed to provide a good balance between accuracy and simplicity.

Consumer Financial Protection Bureau

The final model for the CFPB survey data included

econ_save_rate
house_mortgage
age_8cat
econ_hh_income

# select independent variables
cfpb_df <- cfpb_df %>% select(cfpb_score, econ_save_rate, house_mortgage, age_8cat, econ_hh_income)


# split data
cfpb_split <- initial_split(cfpb_df, prop = 0.8, strata = cfpb_score)
cfpb_training <-  training(cfpb_split)
cfpb_test <-  testing(cfpb_split)

# Model Specification
lm_model <- linear_reg() %>% 
            set_engine('lm') %>% 
            set_mode('regression')


# Fitting to Trained Data
lm_fit <- lm_model %>% 
          fit(cfpb_score ~ ., data = cfpb_training)


# Explore Training Results
names(lm_fit)

## [1] "lvl"     "spec"    "fit"     "preproc" "elapsed"

summary(lm_fit$fit)

## 
## Call:
## stats::lm(formula = cfpb_score ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.025  -6.526  -0.458   6.167  44.964 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         49.6119     2.8403  17.467  < 2e-16 ***
## econ_save_rate0                     -7.2219     2.8054  -2.574 0.010090 *  
## econ_save_rate$1-99                 -8.5856     2.8186  -3.046 0.002338 ** 
## econ_save_rate$100-999              -4.4037     2.7546  -1.599 0.109991    
## econ_save_rate$1,000-4,999          -0.1939     2.7371  -0.071 0.943527    
## econ_save_rate$5,000-19,999          4.4495     2.7306   1.629 0.103307    
## econ_save_rate$20,000-74,999         7.9418     2.7419   2.897 0.003799 ** 
## econ_save_rate$75,000 or more       10.4604     2.7444   3.812 0.000141 ***
## econ_save_rateI don't know          -0.3468     2.9226  -0.119 0.905541    
## econ_save_ratePrefer not to say      4.9050     2.7363   1.793 0.073137 .  
## house_mortgageRefused                5.4563     2.0307   2.687 0.007249 ** 
## house_mortgageLess than $50,000      3.8809     0.5941   6.532 7.53e-11 ***
## house_mortgage$50,000-199,999        1.0324     0.5794   1.782 0.074881 .  
## house_mortgage$200,000 or more       1.0754     0.7584   1.418 0.156307    
## house_mortgageI don't know          -1.6690     1.4794  -1.128 0.259348    
## house_mortgagePrefer not to say      1.5256     0.8160   1.870 0.061640 .  
## age_8cat25-34                       -0.6687     0.8851  -0.756 0.450002    
## age_8cat35-44                       -1.0433     0.9606  -1.086 0.277519    
## age_8cat45-54                       -2.2337     0.9320  -2.397 0.016602 *  
## age_8cat55-61                       -1.3542     0.9748  -1.389 0.164860    
## age_8cat62-69                        2.7485     0.9537   2.882 0.003980 ** 
## age_8cat70-74                        4.4969     1.0607   4.240 2.30e-05 ***
## age_8cat75+                          5.2595     1.0037   5.240 1.71e-07 ***
## econ_hh_income$20,000 to $29,999    -1.2547     0.9357  -1.341 0.180068    
## econ_hh_income$30,000 to $39,999     0.8115     0.8996   0.902 0.367089    
## econ_hh_income$40,000 to $49,999     2.2390     0.9469   2.365 0.018111 *  
## econ_hh_income$50,000 to $59,999     3.1783     0.9519   3.339 0.000850 ***
## econ_hh_income$60,000 to $74,999     3.3230     0.9138   3.636 0.000281 ***
## econ_hh_income$75,000 to $99,999     4.6582     0.8485   5.490 4.34e-08 ***
## econ_hh_income$100,000 to $149,999   6.0806     0.8465   7.184 8.43e-13 ***
## econ_hh_income$150,000 or more       7.5160     0.8938   8.409  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.37 on 3168 degrees of freedom
## Multiple R-squared:  0.4173, Adjusted R-squared:  0.4118 
## F-statistic: 75.64 on 30 and 3168 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)) 
plot(lm_fit$fit, pch = 16, col = '#006EA1')

# Data frame of estimated coefficients
tidy(lm_fit)

## # A tibble: 31 × 5
##    term                            estimate std.error statistic  p.value
##    <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                       49.6        2.84   17.5    2.66e-65
##  2 econ_save_rate0                   -7.22       2.81   -2.57   1.01e- 2
##  3 econ_save_rate$1-99               -8.59       2.82   -3.05   2.34e- 3
##  4 econ_save_rate$100-999            -4.40       2.75   -1.60   1.10e- 1
##  5 econ_save_rate$1,000-4,999        -0.194      2.74   -0.0708 9.44e- 1
##  6 econ_save_rate$5,000-19,999        4.45       2.73    1.63   1.03e- 1
##  7 econ_save_rate$20,000-74,999       7.94       2.74    2.90   3.80e- 3
##  8 econ_save_rate$75,000 or more     10.5        2.74    3.81   1.41e- 4
##  9 econ_save_rateI don't know        -0.347      2.92   -0.119  9.06e- 1
## 10 econ_save_ratePrefer not to say    4.90       2.74    1.79   7.31e- 2
## # … with 21 more rows

# Performance metrics on training data
glance(lm_fit)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.417         0.412  10.4      75.6       0    30 -12006. 24075. 24270.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# variable importance
vip(lm_fit, num_features=15)

#predict(lm_fit, new_data = cfpb_test)

cfpb_results <- predict(lm_fit, new_data = cfpb_test) %>% 
                            bind_cols(cfpb_test)

#cfpb_results

# RMSE on test set
rmse(cfpb_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        10.6

rsq(cfpb_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.427

ggplot(data = cfpb_results,
       mapping = aes(x = .pred, y = cfpb_score)) +
  geom_point(color = '#006EA1') +
  geom_smooth(method = lm) +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results CFPB data set)',
       x = 'Predicted',
       y = 'Actual')

## `geom_smooth()` using formula 'y ~ x'

Random Forest

Additional analysis was conduct by creating a random forest machine learning model using parsnip from the tidy models package. The model was based on the varaiables selected in the previous linear regression model. The full model is available in the DATA607_ML.Rmd file. The resulting model is a categorical model with 4 primary categories for the cfpb score. Even with these broad categories the model reports an accuracy level of 0.505. The confusion matrix below depicts that model performance. The model had difficulties predicting cfpb scores values below 40 and above 80.

Confusion Matrix This was an interesting exploration however the linear regression model seems more usable for this particular problem set.

Neural Network

Further analysis was conducted by creating a neural network using the neuralnet package. The resulting neural network with associated weights is depicted below.

Neural Network

The graph below depicts the mapping of actual cfpb scores to the predicted cfpb scores generated by the neural network.

Neural Network

Federal Reserve

The final model for the CFPB survey data included - age_7cat - econ_saving - econ_inc_4cat - econ_fin_ok - econ_pay_exp400 - econ_skip_med

# select independent variables
#fed_df <- fed_df %>% select(cfpb_score, econ_saving, econ_fin_ok, econ_pay_exp400, econ_skip_med, edu, credit_guess) #0.4379 

fed_df <- fed_df %>% select(cfpb_score, age, age_7cat, econ_saving, econ_inc_4cat, econ_fin_ok, econ_pay_exp400, econ_skip_med) #0.4675

# split data
fed_split <- initial_split(fed_df, prop = 0.8, strata = cfpb_score)
fed_training <- training(fed_split)
fed_test <- testing(fed_split)

# Fitting to Trained Data
lm_fit <- lm_model %>% 
          fit(cfpb_score ~ ., data = fed_training)


# Explore Training Results
names(lm_fit)

## [1] "lvl"     "spec"    "fit"     "preproc" "elapsed"

summary(lm_fit$fit)

## 
## Call:
## stats::lm(formula = cfpb_score ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.135  -6.788  -0.738   6.466  32.090 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    45.45928    1.78738  25.433  < 2e-16 ***
## age                             0.15135    0.06205   2.439  0.01477 *  
## age_7cat25-34                  -3.21769    1.00635  -3.197  0.00140 ** 
## age_7cat35-44                  -5.44628    1.43514  -3.795  0.00015 ***
## age_7cat45-54                  -5.78081    1.98014  -2.919  0.00353 ** 
## age_7cat55-64                  -6.39763    2.50266  -2.556  0.01062 *  
## age_7cat65-74                  -3.12727    3.07008  -1.019  0.30846    
## age_7cat75+                    -3.38188    3.71940  -0.909  0.36328    
## econ_saving$1,000,000 or more   7.32404    0.81005   9.042  < 2e-16 ***
## econ_saving$100,000 - $249,999  1.20829    0.69505   1.738  0.08223 .  
## econ_saving$250,000 - $499,999  1.69734    0.77751   2.183  0.02911 *  
## econ_saving$50,000 - $99,999    0.04141    0.72572   0.057  0.95450    
## econ_saving$500,000 - $999,999  3.39762    0.82441   4.121 3.86e-05 ***
## econ_savingNot sure             2.26184    1.18522   1.908  0.05643 .  
## econ_savingRefused              1.99110    1.41009   1.412  0.15804    
## econ_savingUnder $50,000       -1.64144    0.54485  -3.013  0.00261 ** 
## econ_inc_4cat$25,000-$49,999   -4.33619    0.62143  -6.978 3.64e-12 ***
## econ_inc_4cat$50,000-$99,999   -2.34490    0.50138  -4.677 3.03e-06 ***
## econ_inc_4catLess than $25,000 -3.31709    0.64475  -5.145 2.84e-07 ***
## econ_inc_4catRefused           -3.25108    2.20784  -1.473  0.14098    
## econ_fin_okRefused             15.95144    5.23388   3.048  0.00232 ** 
## econ_fin_okYes                 10.94248    0.60587  18.061  < 2e-16 ***
## econ_pay_exp400Refused          4.54400    2.63361   1.725  0.08456 .  
## econ_pay_exp400Yes              5.75951    0.52473  10.976  < 2e-16 ***
## econ_skip_medRefused           -1.36292    2.60536  -0.523  0.60093    
## econ_skip_medYes               -6.58096    0.53583 -12.282  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 3173 degrees of freedom
## Multiple R-squared:  0.4849, Adjusted R-squared:  0.4808 
## F-statistic: 119.5 on 25 and 3173 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)) 
plot(lm_fit$fit, pch = 16, col = '#006EA1')

# Data frame of estimated coefficients
tidy(lm_fit)

## # A tibble: 26 × 5
##    term                           estimate std.error statistic   p.value
##    <chr>                             <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                      45.5      1.79      25.4   5.01e-130
##  2 age                               0.151    0.0620     2.44  1.48e-  2
##  3 age_7cat25-34                    -3.22     1.01      -3.20  1.40e-  3
##  4 age_7cat35-44                    -5.45     1.44      -3.79  1.50e-  4
##  5 age_7cat45-54                    -5.78     1.98      -2.92  3.53e-  3
##  6 age_7cat55-64                    -6.40     2.50      -2.56  1.06e-  2
##  7 age_7cat65-74                    -3.13     3.07      -1.02  3.08e-  1
##  8 age_7cat75+                      -3.38     3.72      -0.909 3.63e-  1
##  9 econ_saving$1,000,000 or more     7.32     0.810      9.04  2.63e- 19
## 10 econ_saving$100,000 - $249,999    1.21     0.695      1.74  8.22e-  2
## # … with 16 more rows

# Performance metrics on training data
glance(lm_fit)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.485         0.481  10.4      119.       0    25 -12011. 24076. 24240.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# variable importance
vip(lm_fit, num_features=15)

#predict(lm_fit, new_data = fed_test)

fed_results <- predict(lm_fit, new_data = fed_test) %>% 
                            bind_cols(fed_test)

fed_results

## # A tibble: 801 × 9
##    .pred cfpb_score   age age_7cat econ_saving       econ_inc_4cat   econ_fin_ok
##    <dbl>      <int> <int> <fct>    <fct>             <fct>           <fct>      
##  1  64.4         51    45 45-54    "$100,000 - $249… $100,000 or mo… Yes        
##  2  45.7         41    43 35-44    ""                $100,000 or mo… No         
##  3  68.3         90    78 75+      "$50,000 - $99,9… $50,000-$99,999 Yes        
##  4  60.7         82    27 25-34    "$50,000 - $99,9… $50,000-$99,999 Yes        
##  5  42.4         40    55 55-64    "Under $50,000"   Less than $25,… No         
##  6  57.3         57    38 35-44    "Refused"         $100,000 or mo… Yes        
##  7  43.8         44    64 55-64    "Under $50,000"   Less than $25,… No         
##  8  77.4         70    73 65-74    "$1,000,000 or m… $100,000 or mo… Yes        
##  9  77.6         90    74 65-74    "$1,000,000 or m… $100,000 or mo… Yes        
## 10  59.5         64    64 55-64    "Under $50,000"   $25,000-$49,999 Yes        
## # … with 791 more rows, and 2 more variables: econ_pay_exp400 <fct>,
## #   econ_skip_med <fct>

# RMSE on test set
rmse(fed_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        10.6

rsq(fed_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.469

ggplot(data = fed_results,
       mapping = aes(x = .pred, y = cfpb_score)) +
  geom_point(color = '#006EA1') +
  geom_smooth(method = lm) +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression (Fed data set)',
       x = 'Predicted',
       y = 'Actual')

## `geom_smooth()` using formula 'y ~ x'

Random Forest

Additional analysis was conduct by creating a random forest machine learning model using parsnip from the tidy models package. The model was based on the variables selected in the previous linear regression model. The full model is available in the DATA607_ML.Rmd file. The resulting model is a categorical model with 4 primary categories for the cfpb score. Even with these broad categories the model reports an accuracy level of 0.468 The confusion matrix below depicts that model performance.

Confusion Matrix

This was an interesting exploration however the linear regression model seems more usable for this particular problem set.

Neural Network

Further analysis was conducted by creating a neural network using the neuralnet package. The resulting neural network with associated weights is depicted below.

The graph below depicts the mapping of actual cfpb scores to the predicted cfpb scores generated by the neural network.

Neural Network

iNterpret

The testing of the model indicates a clear relationship between the model variables and the observed cfpb score for the surveyed individuals. The residual graph indicates outliers at the low end and the higher end of the cfpb score. Studying those segments of the population might provide some additional insight into the relevance of the cfpb score to the respective populations.

The cfpb score is based on 10 questions that measure your financial wellbeing. The survey questions are self reported which includes some variability and bias. Self reporting also results in incomplete records as not every individual surveyed answers all of the questions.

Unsurprisingly variables representing savings rate, age and income appear in both models. This is consistent with the general consensus in wealth management. Most financial advisers will highlight the importance of both income and spending to your financial wellbeing but more importantly the impact of time. As important as these variable are to financial wellness they are not able to explain more than 30% or 40% of your wellness score.

It is interesting that race, gender, education, house hold composition and education levels are not part of the model. These attributes . The interrelation between race, gender and education with income is not included in this analysis but could help to explain their absence from the model. These variables could be impacting financial wellness through their impact on income.

Challenges

There were several challenges associated with this assignment including:

understanding the data - the survey data is well structured however each code book is 100s of pages describing the individual variables. Understanding the data, selecting the variables and filtering the dataset required a significant amount of analysis and additional reading of associated material from the CFPB and Federal Reserve.
transforming and tiddying the data for each model type - each model type used in the analysis requires descriptive data in a specific format, either numeric or categorical. The random forest and neural network models required some additional analysis to define meaningful categories.
model selection - efficiently filtering through 100s of survey variable to identify 4 or 5 variables that will construct the mdoel.

Conclusion

The linear regression model was more effective than the two random forest and neural network models. The random forest and neural network models struggled to differentiate cases at the high end and the low end of the cfpb score spectrum.

In reviewing the analysis what jumps out is the impact of early decisions on the overall outcome. The technical execution of the analysis is important but early decision on the data to include or exclude in the analysis has a large impact on the shape of the model. This analysis utilized several models with each model generating slightly different results.

I had not expected to resolve any long standing debates in the financial industry but it has been surprising how many different avenues for analysis or questions arose from this discussion.

DATA 607 Final Project: Predicting Financial Wellbeing

David Simbandumwe

Overview

Data Sources:

Consumer Financial Protection Bureau - Financial Wellness Survey

Federal Reserve System - Survey of Household Economic and Decision making (SHED)

Setup

Workflow

Obtain

Scrub

Explore

Consumer Financial Protection Bureau

Federal Reserve System

Model

Consumer Financial Protection Bureau

Random Forest

Neural Network

Federal Reserve

Random Forest

Neural Network

iNterpret

Challenges

Conclusion