Overview
Financial wellness can be defined in different ways for the purposes of this analysis Financial wellness will be defined as an individuals relationship with money. This definition refers to how secure your money is, given all the variables involving an unknown future. Financial Education Council.
This is a broad definition but it includes several key concepts such as: - Preparation for emergencies - Active short term and long term plans and goals - Budgeting and attitudes towards money - Feelings towards your finances
Intuitively I understand the importance of financial wellness to an individuals immediate financial stability, their long-term aggregation of wealth and financial sustainability in retirement. These benefits are well documented, studied and highlighted by financial advisers and wealth management firms. However the impact of financial wellness on your emotional and mental health should not be ignored our understated. The stress, anxiety, depression, sleep deprivation or even lack of motivation that comes with poorly handled money matters will impact all aspects of your life.
Few topics will have such a large impact on an individuals current and future life from both an economic and a mental health perspective. Key to developing financial wellness in the general population is understanding how to measure it and what factor impact it. The goal of this project is to explore the measurements of financial wellness and the factors that determine it. I have been interested in measures and predictors of financial wellness for several years.
For this project I have identified two data sets that share a common measure of financial wellbeing. The Consumer Financial Protection Bureau has developed a measurements of financial wellbeing called the Financial Wellbeing Scale. The Federal Reserve System utilizes this same measurement for financial wellbeing in their annual household decision maker survey.
Data Sources:
There are numerous studies of financial wellness sponsored by financial firms, industry groups and government agencies. I choose to focus on two studies:
Consumer Financial Protection Bureau - Financial Wellness Survey Financial well-being data The Public Use File (PUF) survey results can be accessed as a csv file Financial well-being survey data
Federal Reserve System - Survey of Household Economic and Decision making Federal Reserve Household Economic Decisions The SHED survey results can be accessed as a zip file Survey of Household Economics and Decisionmaking 2020
These are comprehensive studies that are conducted periodically using similar methodology. Both studies survey the population to gather demographic, financial and behavioral data from individuals. Both studies include the same measure of financial wellbeing that was defined and operationalized by the Consumer Financial Protection Bureau called the Financial Well-Being Scale. The Financial Well-Being Scale was created as part of the Consumer Financial Protection Bureau to encourage Financial Wellness.
The score is calculated using 10 questions survey:
How well does this statement describe you or your situation?
- I could handle a major unexpected expense
- I am securing my financial future
- Because of my money situation, I feel like I will never have the things I want in life
- I can enjoy life because of the way I’m managing my money
- I am just getting by financially
- I am concerned that the money I have or will save won’t last
How often does this statement apply to you?
- Giving a gift for a wedding, birthday or other occasion would put a strain on my finances for the month*
- I have money left over at the end of the month
- I am behind with my finances
- My finances control my life
Consumer Financial Protection Bureau - Financial Wellness Survey
The data was collected as part of the Consumer Financial Protection Bureau’s (CFPB) National Financial Well-Being Survey Public Use File (PUF). The PUF is a data set containing:
- demographic data
- assessment of financial knowledge
- financial positions
- financial products owned
- financial well-being scale
The National Financial Wellbeing Survey was conducted in English and Spanish digitally / online between October 27, 2016 and December 5, 2016. Overall, 6,394 surveys were completed: 5,395 from the general population sample and 999 from an oversample of adults aged 62 and older. The survey was designed to represent the adult population of the 50 U.S. states and the District of Columbia. The survey was fielded on the GfK KnowledgePanel®. The KnowledgePanel sample is recruited using address-based sampling and dual-frame landline and cell phone random digit dialing methods.
The PUF was published in 2017.
Federal Reserve System - Survey of Household Economic and Decision making (SHED)
The data was collected as part of the Report on the Economic Well‐Being of U.S. Households in 2020,and examines the economic well-being and financial lives of U.S. adults and their families.
The SHED includes:
- demographic data
- assessment of financial knowledge
- financial products owned
- financial positions
- economic fragility
- financial well-being scale
The SHED is sponsored by the Board of Governors of the Federal Reserve System. The surveys that generate the data is collected by Ipsos using their online probability based KnowledgePanel. The 2020 survey of more than 11,000 adults was conducted in November 2020, offering a picture of how people were faring eight months after the onset of the COVID-19 pandemic.
The SHED was published in 2020.
Setup
set.seed(1234)
rm(list=ls())# Exported data load, tidy function and tibble cleaning
fun_file_name <- glue(getwd(), "/ProjectFinal/DATA607_Functions.R")
source(fun_file_name, local = knitr::knit_global())Workflow
This project is based on the OSEMN analysis Model. The OSEMN model is defined by the following steps:
- Obtain - Source a sufficient corpus of usable data to conduct the desired analysis
- Scrub - Preparing the data for further analysis by restructuring, cleaning, and filtering the data set
- Explore - Conduct an initial exploration of the data to better understand the cases and the variables
- Model - Fit the model to the data once it has been cleaned, explored and properly formatted
- iNterpret - Explore the output of the model to understand the accuracy
OSEMN analysis Model
Obtain
The data for the analysis was obtained from the following sources:
Consumer Financial Protection Bureau - Financial Wellness Survey: The PUF survey results can be accessed as a csv file Financial well-being survey data
Federal Reserve System - Survey of Household Economic and Decision making: The SHED survey results can be accesssed as a zip file Survey of Household Economics and Decisionmaking 2020
# load raw data
cfpbRaw_df <- getRawCFPBFile()
fedRaw_df <- getRawFedFile()
# calculate sample stats for each survey
s_df <- tibble("name"=character(), "vars"=double(), "obs"=double(), "cfpb_mean"=double(),
"cfpb_median"=double(), "cfpb_std"=double() )
s_df <- s_df %>% add_row(name="cfpb", vars=ncol(cfpbRaw_df) ,obs=nrow(cfpbRaw_df),
cfpb_mean=round(mean(cfpbRaw_df$FWBscore),2),
cfpb_median=round(median(cfpbRaw_df$FWBscore),2),
cfpb_std=round(sd(cfpbRaw_df$FWBscore),2))
s_df <- s_df %>% add_row(name="fed", vars=ncol(fedRaw_df) ,obs=nrow(fedRaw_df),
cfpb_mean=round(mean(fedRaw_df$CFPB_score),2),
cfpb_median=round(median(fedRaw_df$CFPB_score),2),
cfpb_std=round(sd(fedRaw_df$CFPB_score),2) )
s_df %>%
kbl() %>%
column_spec(1, bold = T, border_right = T) %>%
kable_paper(bootstrap_options = "striped", fixed_thead = T,full_width = F)| name | vars | obs | cfpb_mean | cfpb_median | cfpb_std |
|---|---|---|---|---|---|
| cfpb | 217 | 6394 | 56.03 | 56 | 14.15 |
| fed | 372 | 11648 | 56.32 | 55 | 14.85 |
# cfpb
ggplot(cfpbRaw_df,
aes(FWBscore, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(FWBscore)), linetype = "dashed", color="red", size = 0.5) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (cfpb dataset)")summary(cfpbRaw_df$FWBscore)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.00 48.00 56.00 56.03 65.00 95.00
# fed
ggplot(fedRaw_df,
aes(CFPB_score, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(CFPB_score)), linetype = "dashed", color="red", size = 0.5) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (federal reserve data set)")summary(fedRaw_df$CFPB_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 47.00 55.00 56.32 66.00 90.00
The cfpb scores from the Federal Consumer Protection Bureau and the Federal Reserve System appear to follow a normal distribution with similar sample statistics for mean median and standard deviation.
observations
- The Financial Wellbeing Scaled score has consistent stats across both surveys
- The Survey’s are comprehensive and will require filtering and variable selection to create a manageable data set of cases and variables.
- Most of the variables are categorical
- The CFPB survey data is coded based on the survey responses. To be usable the data can be converted to factor variables that capture the survey code and the definition
- The survey response data requires the associated code books to decipher the variables in the file.
Scrub
The scrub step in the process leverages the two code books for the survey to filter, translate and rename the variables.
- Variable Selection - To identifying a subset of the variable to use in the analysis I relied on existing theory, survey findings and variables that I was interested in exploring. Additional / different variables could be used in subsequent analysis to explore other theories.
- Transforming - To transform the survey data I relied on the code books for each surveys to better understand the attribute and response coding. The Fed survey results included full text answer in a non UTF-8 format. The data was transformed to a UTF-8 format and truncated for readability.
- Filtering - To reduce the compute time and make the analysis manageable I took a random subset of the data ensuring that preserved the breakdown of cfpb score categories from the full data set.
- Outliers - there are a small number of records with a negative fin well scale score. Given the methodology for calculating the score these observations are anomalies and were deleted from the data set.
# get cfpb file
cfpb_df <- getCFPBFile()## The following `from` values were not present in `x`: -1, 1, 2, 3, 4, 5, 6, 7, 98, 99
cfpb_df <- cfpb_df %>% filter( cfpb_score > 0)
cfpb_df$cfpb_score_4cat <- cut(cfpb_df$cfpb_score, breaks = c(-10, 40, 60, 80, 100),
labels = c("< 40","40-60","60-80","80-100"),
right = FALSE,
include.lowest=TRUE)
s_df <- tibble("cfpb" = levels(count(cfpb_df$cfpb_score_4cat)$x))
s_df <- add_column(s_df,"cfpb_freq" = count(cfpb_df$cfpb_score_4cat)$freq)
cfpb_df <- cfpb_df %>% filter(cfpb_score >= 0)
# reduce cfpb dataset
cfpb_df <- slice_sample(cfpb_df, weight_by=cfpb_score_4cat ,n=4000)
s_df <- add_column(s_df,"cfpb_filter" = levels(count(cfpb_df$cfpb_score_4cat)$x))
s_df <- add_column(s_df,"cfpb_filter_freq" = count(cfpb_df$cfpb_score_4cat)$freq)
# get fed file
fed_df <- getFedFile()
fed_df$cfpb_score_4cat <- cut(fed_df$cfpb_score, breaks = c(-10, 40, 60, 80, 100),
labels = c("< 40","40-60","60-80","80-100"),
right = FALSE,
include.lowest=TRUE)
s_df <- add_column(s_df,"fed" = levels(count(fed_df$cfpb_score_4cat)$x))
s_df <- add_column(s_df,"fed_freq" = count(fed_df$cfpb_score_4cat)$freq)
# reduce cfpb dataset
fed_df <- slice_sample(fed_df, weight_by=cfpb_score_4cat ,n=4000)
s_df <- add_column(s_df,"fed_filter" = levels(count(fed_df$cfpb_score_4cat)$x))
s_df <- add_column(s_df,"fed_filter_freq" = count(fed_df$cfpb_score_4cat)$freq)The survey results were sampled using a consistent weight for cfpb score. The resulting dataset includes 4000 surveys from each survey data set.
s_df %>% kbl() %>%
kable_paper(bootstrap_options = "striped", fixed_thead = T,full_width = F)| cfpb | cfpb_freq | cfpb_filter | cfpb_filter_freq | fed | fed_freq | fed_filter | fed_filter_freq |
|---|---|---|---|---|---|---|---|
| < 40 | 728 | < 40 | 273 | < 40 | 1349 | < 40 | 230 |
| 40-60 | 3152 | 40-60 | 1849 | 40-60 | 5769 | 40-60 | 1775 |
| 60-80 | 2186 | 60-80 | 1599 | 60-80 | 3566 | 60-80 | 1504 |
| 80-100 | 323 | 80-100 | 279 | 80-100 | 964 | 80-100 | 491 |
dim(cfpb_df)## [1] 4000 24
dim(fed_df)## [1] 4000 43
# cfpb
ggplot(cfpb_df,
aes(cfpb_score, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", color="red", size = 0.5) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (cfpb dataset)")summary(cfpb_df$cfpb_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 50.00 59.00 58.58 67.00 95.00
# fed
ggplot(fed_df,
aes(cfpb_score, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", color="red", size = 0.5) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (federal reserve data set)")summary(fed_df$cfpb_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 50.00 59.00 60.05 70.00 90.00
observations
- Sampling - the sampling of the two survey data sets did not impact the distribution of the cfpb score in either data set.
- Outliers - the data includes outlier score below 20 and above 90. Since this distribution is likely a reflection of the population these records were retained in the data set.
Explore
The survey data is comprehensive with 217 variables in the CFPB Survey and 372 variables in the Federal Reserve survey. To reduce the solution space the survey data was pre-filtered to only include variables that I thought would be likely to contribute to the effectiveness of the model. A subset of the variable exploration is included in this document additional visualization and analysis can be found in the accompanied DATA607_Plot document and the DATA607_STEP document.
Consumer Financial Protection Bureau - Survey
- econ_save_rate (SAVINGSRANGES) - How much money do you have in savings today…?
- house_mortgage (MORTGAGE) - What do you owe on your home today?
- age_8cat (agecat) - Age in 8 categories
- econ_hh_income (PPINCIMP) - Household Income
Federal Reserve System - Survey
- age_7cat (ppagecat) - Age in 7 Categories
- econ_saving (ppfs0596) - Q22: What is the approximate total amount of your household’s savings and invest
- econ_inc_4cat (inc_4cat_50k) - Income (I40) ‐ 4 cat ‐ <25,25‐49,50‐99,100+
- econ_fin_ok (atleast_okay) - Doing okay financially
- econ_pay_exp400 (pay_casheqv) - Would handle $400 expense with cash or equivalent
- econ_skip_med (skip_medical) - Went without medical care
Consumer Financial Protection Bureau
From the initial exploration of the dependent variable, Financial Well-Being Scale (cfpb_score) follows the normal distribution with a cluster of results on the high and low end of scale. The median and the mean are similar and the the distribution appears to be symmetric. The structure of the distribution suggest that there might be an opportunity to explore the top range of the distribution and the lower end of the distribution separately. This seems consistent with how the poor and the wealthy experience and navigate the economy.
Each of the survey questions includes the option not to provide answer. This creates incomplete data sets for a small proportion of the survey responses. The records for these incomplete cases where not filtered out until the specific model required complete records.
#econ_save_rate + house_mortgage + age_8cat + econ_hh_income
cfpb_tmp_df <- cfpb_df %>% select(cfpb_score, econ_save_rate, house_mortgage, age_8cat, econ_hh_income)
summary(cfpb_tmp_df)## cfpb_score econ_save_rate house_mortgage
## Min. :14.00 $5,000-19,999 :708 Question not asked:1270
## 1st Qu.:50.00 Prefer not to say:633 Refused : 36
## Median :59.00 $75,000 or more :609 Less than $50,000 :1040
## Mean :58.58 $1,000-4,999 :567 $50,000-199,999 : 878
## 3rd Qu.:67.00 $20,000-74,999 :564 $200,000 or more : 384
## Max. :95.00 $100-999 :397 I don't know : 68
## (Other) :522 Prefer not to say : 324
## age_8cat econ_hh_income
## 62-69 :684 $100,000 to $149,999:719
## 25-34 :671 $75,000 to $99,999 :612
## 45-54 :656 $150,000 or more :599
## 75+ :493 $60,000 to $74,999 :423
## 35-44 :465 Less than $20,000 :371
## 55-61 :460 $30,000 to $39,999 :358
## (Other):571 (Other) :918
Observations
Low correlation across numeric values - There are limited numeric values in the survey data. The numeric data that exists is not heavily correlated. This is somewhat surprising given that at least 4 of the numeric values are defined to reflect financial knowledge or financial wellbeing. The only two variable that display a correlation above 0.5 are the cfpb score for scaled financial wellness (FWBscore) and the Lusardi and Mitchell financial knowledge Summative scale scale scorescore (LMscore).
Correlation varies by education and income level - Even though the overall correlation between numeric variables is low there is a difference in correlation across education levels and income groups. Segmenting the population using these variables is an avenue for further analysis.
ggscatmat(data=cfpb_df, corMethod = "spearman", alpha=0.2)## Warning in ggscatmat(data = cfpb_df, corMethod = "spearman", alpha = 0.2):
## Factor variables are omitted in plot
ggscatmat(data=cfpb_df, color="edu_level", corMethod = "spearman", alpha=0.2)## Warning in ggscatmat(data = cfpb_df, color = "edu_level", corMethod =
## "spearman", : Factor variables are omitted in plot
ggscatmat(data=cfpb_df, color="econ_hh_income", corMethod = "spearman", alpha=0.2)## Warning in ggscatmat(data = cfpb_df, color = "econ_hh_income", corMethod =
## "spearman", : Factor variables are omitted in plot
explore specific variables
The top categories for house values, morgage and income start at $150,000, $200,000 and $150,000. The lowest categories for house value, mortgage and income include $150,000, $50,000 and $20,000. The lowest categories and highest categories for these variables likely exhibit different behavior and economic engagement.
# econ_hh_income
ggplot(cfpb_df, aes(x=econ_hh_income)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "econ_hh_income")ggplot(data = cfpb_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=econ_hh_income, color=econ_hh_income)) +
labs (title = "econ_hh_income")ggplot(cfpb_df,
aes(cfpb_score, fill = econ_hh_income, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by econ_hh_income)")The initial review of the comparative visualizations for race and cfpb score seem to indicate that race may not be a primary driver of financial wellbeing. There is some correlation between race and wellness score but the strength of that relationship is not directly evident in the following graphs.
# race
ggplot(cfpb_df, aes(x=race)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "race")ggplot(data = cfpb_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=race, color=race)) +
labs (title = "race")ggplot(cfpb_df,
aes(cfpb_score, fill = race, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by race)")The distribution of responses for the house mortgage survey question seem to reflect an unwillingness by survey respondents to share these details with the pollster. Just over 1700 survey respondents declined to answer the question.
# house_mortgage
ggplot(cfpb_df, aes(x=house_mortgage)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "house_mortgage")ggplot(data = cfpb_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=house_mortgage, color=house_mortgage)) +
labs (title = "house_mortgage")ggplot(cfpb_df,
aes(cfpb_score, fill = house_mortgage, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by house_mortgage)")count(cfpb_df$house_mortgage)## x freq
## 1 Question not asked 1270
## 2 Refused 36
## 3 Less than $50,000 1040
## 4 $50,000-199,999 878
## 5 $200,000 or more 384
## 6 I don't know 68
## 7 Prefer not to say 324
Federal Reserve System
The data set from the Federal Reserve survey includes several questions that produce variable with overlapping measures. This includes the variables that measure education level, employment, house hold composition and economic circumstances. As part of the modeling selection process I will look to rationalize the list of variables so that I do not include several variables that measure the same underlying characteristic. Similar to the survey data from the CFPB a portion of the respondent declined to answer specific questions. I will filter out incomplete cases as required by the model being used.
Observations
- Few numeric variables with limited correlations. - similar to the CFPB survey data the Federal Reserve survey data reflects low correlation between the numeric variables in the data set. It should also be noted that the FED data set includes very few unique numeric variable. The overwhelming proportion of survey questions are designed around answer that can be placed in categories. The numeric numbers that exist primarily focus on house hold composition.
- survey data includes non complete records - the second phase of the analysis would be to identify smaller subset of the variable and review the impact of incomplete data sets and outliers.
# age_7cat + econ_saving + econ_inc_4cat + econ_fin_ok +
# econ_pay_exp400 + econ_skip_med
#fed_tmp_df <- fed_df %>% select(cfpb_score, econ_saving, econ_fin_ok, econ_pay_exp400, econ_skip_med, edu, credit_guess)
fed_tmp_df <- fed_df %>% select(cfpb_score, age_7cat, econ_saving, econ_inc_4cat, econ_fin_ok, econ_pay_exp400, econ_skip_med) #0.4675
summary(fed_tmp_df)## cfpb_score age_7cat econ_saving
## Min. :19.00 18-24:252 Under $50,000 :1175
## 1st Qu.:50.00 25-34:646 : 796
## Median :59.00 35-44:557 $100,000 - $249,999: 455
## Mean :60.05 45-54:528 $50,000 - $99,999 : 392
## 3rd Qu.:70.00 55-64:845 $1,000,000 or more : 339
## Max. :90.00 65-74:786 $250,000 - $499,999: 335
## 75+ :386 (Other) : 508
## econ_inc_4cat econ_fin_ok econ_pay_exp400 econ_skip_med
## $100,000 or more :1419 No : 669 No :1001 No :3272
## $25,000-$49,999 : 671 Refused: 4 Refused: 22 Refused: 20
## $50,000-$99,999 :1061 Yes :3327 Yes :2977 Yes : 708
## Less than $25,000: 810
## Refused : 39
##
##
The highest correlation in the numeric data is for children age 13 - 17 and over 18 years. These variables have a high correlations with cfpb score but they do not contributes substantially to the model .
ggscatmat(data=fed_df, corMethod = "spearman", alpha=0.2)## Warning in ggscatmat(data = fed_df, corMethod = "spearman", alpha = 0.2): Factor
## variables are omitted in plot
ggscatmat(data=fed_df, color="race", corMethod = "spearman", alpha=0.2)## Warning in ggscatmat(data = fed_df, color = "race", corMethod = "spearman", :
## Factor variables are omitted in plot
ggscatmat(data=fed_df, color="gender", corMethod = "spearman", alpha=0.2)## Warning in ggscatmat(data = fed_df, color = "gender", corMethod = "spearman", :
## Factor variables are omitted in plot
ggscatmat(data=fed_df, color="age_7cat", corMethod = "spearman", alpha=0.2)## Warning in ggscatmat(data = fed_df, color = "age_7cat", corMethod =
## "spearman", : Factor variables are omitted in plot
explore specific variables
The Economic circumstance variables (econ_fin_ok, econ_pay_exp400, econ_skip_med) are heavily represented in the federal reserve model. This is not surprising given the nature of the questions that define the cfpb score. It indicate the survey respondents are aware of their current financial wellness circumstance. As you can see from the grouped histogram there is distinct shift in the distribution upwards for survey respondents who feel they are on firmer footing financial. The alignment of the decision to forgo medical care based on on cfpb score highlights some real world impacts financial wellness. Finally it should be noted that there are cluster of outlier data at the high end and lower end of the distribution.
# econ_fin_ok
ggplot(fed_df, aes(x=econ_fin_ok)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "econ_fin_ok")ggplot(data = fed_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=econ_fin_ok, color=econ_fin_ok)) +
labs (title = "econ_fin_ok")ggplot(fed_df,
aes(cfpb_score, fill = econ_fin_ok, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by econ_fin_ok)")# econ_skip_med
ggplot(fed_df, aes(x=econ_skip_med)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "econ_skip_med")ggplot(fed_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=econ_skip_med, color=econ_skip_med)) +
labs (title = "econ_skip_med")ggplot(fed_df,
aes(cfpb_score, fill = econ_skip_med, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by econ_skip_med)")# econ_pay_exp400
ggplot(fed_df, aes(x=econ_pay_exp400)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "econ_pay_exp400")ggplot(data = fed_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=econ_pay_exp400, color=econ_pay_exp400)) +
labs (title = "econ_pay_exp400")ggplot(fed_df,
aes(cfpb_score, fill = econ_pay_exp400, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by econ_pay_exp400)")Model
This analysis will focus on a linear regression model as the foundation of the conclusions. However one of the goals for this project was to explore different data science models. To these ends I will touch on a random forest model and neural network model that will utilize the same data set to generate a predictive model.
The analysis includes the following models:
- linear regression model using parsnip packages from tidy models
- random forest model using parsnip packages from tidy models
- neural network model using the neuralnet package
Preconditions for Linear Models
The residual analysis from the CFPB model and Federal Reserve model roughly adheres to the following pre-conditions for linear modeling. More analysis would be required to ensure independence and a linear relationship to the dependent variable. It would be difficult to assert full independence of the predictor variables given that each case reflect an individual. Thre are numerouse studies that link demographics, race, geography and age to economic outcomes.
- residuals are nearly normal
- residuals have constant variability
- residuals are independent
- each variable is linearly related to the outcome
Model Definition and Variable Selection
The analysis started with a review of the survey code book to identify candidate variables for the analysis. Then relying on the findngs from each survey, and industry knowledge I conducted an initial variable selection. This selection process was used as a starting point, additional analysis could be conducted to refine the initial lists.
24 variables from the CFPB Survey.
- house hold structure
- housing economics and ownership characteristics
- income
- demographic
- geographic
- measures of financial knowledge and financial wellness
43 variable from the Fed Survey file were selected.
- house hold structure
- housing economics and ownership characteristics
- income
- demographic
- geographic
- measures of financial knowledge and financial wellness
- assessment of econmic situation
As an extension of the variable exploration stepwise regression was used to optimize the model variables based on their contribution to the model accuracy. I started using manual stepwise variable elimination process but opted for automated approach to simplify the analysis. The two approaches converged on models with similar accuracy and complexity but included different variables.
I used three automated model selection methods from the tidyverse packages for stepwise regression, Akaike information criterion(AIC) and Bayesian information criterion(BIC). The three methods of stepwise regression generated several candidates for the final model. I selected 2 models that although did not have the highest r-squared seemed to provide a good balance between accuracy and simplicity.
Consumer Financial Protection Bureau
The final model for the CFPB survey data included
- econ_save_rate
- house_mortgage
- age_8cat
- econ_hh_income
# select independent variables
cfpb_df <- cfpb_df %>% select(cfpb_score, econ_save_rate, house_mortgage, age_8cat, econ_hh_income)
# split data
cfpb_split <- initial_split(cfpb_df, prop = 0.8, strata = cfpb_score)
cfpb_training <- training(cfpb_split)
cfpb_test <- testing(cfpb_split)# Model Specification
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
# Fitting to Trained Data
lm_fit <- lm_model %>%
fit(cfpb_score ~ ., data = cfpb_training)
# Explore Training Results
names(lm_fit)## [1] "lvl" "spec" "fit" "preproc" "elapsed"
summary(lm_fit$fit)##
## Call:
## stats::lm(formula = cfpb_score ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.025 -6.526 -0.458 6.167 44.964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.6119 2.8403 17.467 < 2e-16 ***
## econ_save_rate0 -7.2219 2.8054 -2.574 0.010090 *
## econ_save_rate$1-99 -8.5856 2.8186 -3.046 0.002338 **
## econ_save_rate$100-999 -4.4037 2.7546 -1.599 0.109991
## econ_save_rate$1,000-4,999 -0.1939 2.7371 -0.071 0.943527
## econ_save_rate$5,000-19,999 4.4495 2.7306 1.629 0.103307
## econ_save_rate$20,000-74,999 7.9418 2.7419 2.897 0.003799 **
## econ_save_rate$75,000 or more 10.4604 2.7444 3.812 0.000141 ***
## econ_save_rateI don't know -0.3468 2.9226 -0.119 0.905541
## econ_save_ratePrefer not to say 4.9050 2.7363 1.793 0.073137 .
## house_mortgageRefused 5.4563 2.0307 2.687 0.007249 **
## house_mortgageLess than $50,000 3.8809 0.5941 6.532 7.53e-11 ***
## house_mortgage$50,000-199,999 1.0324 0.5794 1.782 0.074881 .
## house_mortgage$200,000 or more 1.0754 0.7584 1.418 0.156307
## house_mortgageI don't know -1.6690 1.4794 -1.128 0.259348
## house_mortgagePrefer not to say 1.5256 0.8160 1.870 0.061640 .
## age_8cat25-34 -0.6687 0.8851 -0.756 0.450002
## age_8cat35-44 -1.0433 0.9606 -1.086 0.277519
## age_8cat45-54 -2.2337 0.9320 -2.397 0.016602 *
## age_8cat55-61 -1.3542 0.9748 -1.389 0.164860
## age_8cat62-69 2.7485 0.9537 2.882 0.003980 **
## age_8cat70-74 4.4969 1.0607 4.240 2.30e-05 ***
## age_8cat75+ 5.2595 1.0037 5.240 1.71e-07 ***
## econ_hh_income$20,000 to $29,999 -1.2547 0.9357 -1.341 0.180068
## econ_hh_income$30,000 to $39,999 0.8115 0.8996 0.902 0.367089
## econ_hh_income$40,000 to $49,999 2.2390 0.9469 2.365 0.018111 *
## econ_hh_income$50,000 to $59,999 3.1783 0.9519 3.339 0.000850 ***
## econ_hh_income$60,000 to $74,999 3.3230 0.9138 3.636 0.000281 ***
## econ_hh_income$75,000 to $99,999 4.6582 0.8485 5.490 4.34e-08 ***
## econ_hh_income$100,000 to $149,999 6.0806 0.8465 7.184 8.43e-13 ***
## econ_hh_income$150,000 or more 7.5160 0.8938 8.409 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.37 on 3168 degrees of freedom
## Multiple R-squared: 0.4173, Adjusted R-squared: 0.4118
## F-statistic: 75.64 on 30 and 3168 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_fit$fit, pch = 16, col = '#006EA1')# Data frame of estimated coefficients
tidy(lm_fit)## # A tibble: 31 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 49.6 2.84 17.5 2.66e-65
## 2 econ_save_rate0 -7.22 2.81 -2.57 1.01e- 2
## 3 econ_save_rate$1-99 -8.59 2.82 -3.05 2.34e- 3
## 4 econ_save_rate$100-999 -4.40 2.75 -1.60 1.10e- 1
## 5 econ_save_rate$1,000-4,999 -0.194 2.74 -0.0708 9.44e- 1
## 6 econ_save_rate$5,000-19,999 4.45 2.73 1.63 1.03e- 1
## 7 econ_save_rate$20,000-74,999 7.94 2.74 2.90 3.80e- 3
## 8 econ_save_rate$75,000 or more 10.5 2.74 3.81 1.41e- 4
## 9 econ_save_rateI don't know -0.347 2.92 -0.119 9.06e- 1
## 10 econ_save_ratePrefer not to say 4.90 2.74 1.79 7.31e- 2
## # … with 21 more rows
# Performance metrics on training data
glance(lm_fit) ## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.417 0.412 10.4 75.6 0 30 -12006. 24075. 24270.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# variable importance
vip(lm_fit, num_features=15)#predict(lm_fit, new_data = cfpb_test)
cfpb_results <- predict(lm_fit, new_data = cfpb_test) %>%
bind_cols(cfpb_test)
#cfpb_results
# RMSE on test set
rmse(cfpb_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 10.6
rsq(cfpb_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.427
ggplot(data = cfpb_results,
mapping = aes(x = .pred, y = cfpb_score)) +
geom_point(color = '#006EA1') +
geom_smooth(method = lm) +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results CFPB data set)',
x = 'Predicted',
y = 'Actual')## `geom_smooth()` using formula 'y ~ x'
Random Forest
Additional analysis was conduct by creating a random forest machine learning model using parsnip from the tidy models package. The model was based on the varaiables selected in the previous linear regression model. The full model is available in the DATA607_ML.Rmd file. The resulting model is a categorical model with 4 primary categories for the cfpb score. Even with these broad categories the model reports an accuracy level of 0.505. The confusion matrix below depicts that model performance. The model had difficulties predicting cfpb scores values below 40 and above 80.
This was an interesting exploration however the linear regression model seems more usable for this particular problem set.
Neural Network
Further analysis was conducted by creating a neural network using the neuralnet package. The resulting neural network with associated weights is depicted below.
Neural Network
The graph below depicts the mapping of actual cfpb scores to the predicted cfpb scores generated by the neural network.
Neural Network
Federal Reserve
The final model for the CFPB survey data included - age_7cat - econ_saving - econ_inc_4cat - econ_fin_ok - econ_pay_exp400 - econ_skip_med
# select independent variables
#fed_df <- fed_df %>% select(cfpb_score, econ_saving, econ_fin_ok, econ_pay_exp400, econ_skip_med, edu, credit_guess) #0.4379
fed_df <- fed_df %>% select(cfpb_score, age, age_7cat, econ_saving, econ_inc_4cat, econ_fin_ok, econ_pay_exp400, econ_skip_med) #0.4675
# split data
fed_split <- initial_split(fed_df, prop = 0.8, strata = cfpb_score)
fed_training <- training(fed_split)
fed_test <- testing(fed_split)# Fitting to Trained Data
lm_fit <- lm_model %>%
fit(cfpb_score ~ ., data = fed_training)
# Explore Training Results
names(lm_fit)## [1] "lvl" "spec" "fit" "preproc" "elapsed"
summary(lm_fit$fit)##
## Call:
## stats::lm(formula = cfpb_score ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.135 -6.788 -0.738 6.466 32.090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.45928 1.78738 25.433 < 2e-16 ***
## age 0.15135 0.06205 2.439 0.01477 *
## age_7cat25-34 -3.21769 1.00635 -3.197 0.00140 **
## age_7cat35-44 -5.44628 1.43514 -3.795 0.00015 ***
## age_7cat45-54 -5.78081 1.98014 -2.919 0.00353 **
## age_7cat55-64 -6.39763 2.50266 -2.556 0.01062 *
## age_7cat65-74 -3.12727 3.07008 -1.019 0.30846
## age_7cat75+ -3.38188 3.71940 -0.909 0.36328
## econ_saving$1,000,000 or more 7.32404 0.81005 9.042 < 2e-16 ***
## econ_saving$100,000 - $249,999 1.20829 0.69505 1.738 0.08223 .
## econ_saving$250,000 - $499,999 1.69734 0.77751 2.183 0.02911 *
## econ_saving$50,000 - $99,999 0.04141 0.72572 0.057 0.95450
## econ_saving$500,000 - $999,999 3.39762 0.82441 4.121 3.86e-05 ***
## econ_savingNot sure 2.26184 1.18522 1.908 0.05643 .
## econ_savingRefused 1.99110 1.41009 1.412 0.15804
## econ_savingUnder $50,000 -1.64144 0.54485 -3.013 0.00261 **
## econ_inc_4cat$25,000-$49,999 -4.33619 0.62143 -6.978 3.64e-12 ***
## econ_inc_4cat$50,000-$99,999 -2.34490 0.50138 -4.677 3.03e-06 ***
## econ_inc_4catLess than $25,000 -3.31709 0.64475 -5.145 2.84e-07 ***
## econ_inc_4catRefused -3.25108 2.20784 -1.473 0.14098
## econ_fin_okRefused 15.95144 5.23388 3.048 0.00232 **
## econ_fin_okYes 10.94248 0.60587 18.061 < 2e-16 ***
## econ_pay_exp400Refused 4.54400 2.63361 1.725 0.08456 .
## econ_pay_exp400Yes 5.75951 0.52473 10.976 < 2e-16 ***
## econ_skip_medRefused -1.36292 2.60536 -0.523 0.60093
## econ_skip_medYes -6.58096 0.53583 -12.282 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 3173 degrees of freedom
## Multiple R-squared: 0.4849, Adjusted R-squared: 0.4808
## F-statistic: 119.5 on 25 and 3173 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_fit$fit, pch = 16, col = '#006EA1')# Data frame of estimated coefficients
tidy(lm_fit)## # A tibble: 26 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 45.5 1.79 25.4 5.01e-130
## 2 age 0.151 0.0620 2.44 1.48e- 2
## 3 age_7cat25-34 -3.22 1.01 -3.20 1.40e- 3
## 4 age_7cat35-44 -5.45 1.44 -3.79 1.50e- 4
## 5 age_7cat45-54 -5.78 1.98 -2.92 3.53e- 3
## 6 age_7cat55-64 -6.40 2.50 -2.56 1.06e- 2
## 7 age_7cat65-74 -3.13 3.07 -1.02 3.08e- 1
## 8 age_7cat75+ -3.38 3.72 -0.909 3.63e- 1
## 9 econ_saving$1,000,000 or more 7.32 0.810 9.04 2.63e- 19
## 10 econ_saving$100,000 - $249,999 1.21 0.695 1.74 8.22e- 2
## # … with 16 more rows
# Performance metrics on training data
glance(lm_fit) ## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.485 0.481 10.4 119. 0 25 -12011. 24076. 24240.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# variable importance
vip(lm_fit, num_features=15)#predict(lm_fit, new_data = fed_test)
fed_results <- predict(lm_fit, new_data = fed_test) %>%
bind_cols(fed_test)
fed_results ## # A tibble: 801 × 9
## .pred cfpb_score age age_7cat econ_saving econ_inc_4cat econ_fin_ok
## <dbl> <int> <int> <fct> <fct> <fct> <fct>
## 1 64.4 51 45 45-54 "$100,000 - $249… $100,000 or mo… Yes
## 2 45.7 41 43 35-44 "" $100,000 or mo… No
## 3 68.3 90 78 75+ "$50,000 - $99,9… $50,000-$99,999 Yes
## 4 60.7 82 27 25-34 "$50,000 - $99,9… $50,000-$99,999 Yes
## 5 42.4 40 55 55-64 "Under $50,000" Less than $25,… No
## 6 57.3 57 38 35-44 "Refused" $100,000 or mo… Yes
## 7 43.8 44 64 55-64 "Under $50,000" Less than $25,… No
## 8 77.4 70 73 65-74 "$1,000,000 or m… $100,000 or mo… Yes
## 9 77.6 90 74 65-74 "$1,000,000 or m… $100,000 or mo… Yes
## 10 59.5 64 64 55-64 "Under $50,000" $25,000-$49,999 Yes
## # … with 791 more rows, and 2 more variables: econ_pay_exp400 <fct>,
## # econ_skip_med <fct>
# RMSE on test set
rmse(fed_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 10.6
rsq(fed_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.469
ggplot(data = fed_results,
mapping = aes(x = .pred, y = cfpb_score)) +
geom_point(color = '#006EA1') +
geom_smooth(method = lm) +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression (Fed data set)',
x = 'Predicted',
y = 'Actual')## `geom_smooth()` using formula 'y ~ x'
Random Forest
Additional analysis was conduct by creating a random forest machine learning model using parsnip from the tidy models package. The model was based on the variables selected in the previous linear regression model. The full model is available in the DATA607_ML.Rmd file. The resulting model is a categorical model with 4 primary categories for the cfpb score. Even with these broad categories the model reports an accuracy level of 0.468 The confusion matrix below depicts that model performance.
Confusion Matrix
This was an interesting exploration however the linear regression model seems more usable for this particular problem set.
Neural Network
Further analysis was conducted by creating a neural network using the neuralnet package. The resulting neural network with associated weights is depicted below.
The graph below depicts the mapping of actual cfpb scores to the predicted cfpb scores generated by the neural network.
Neural Network
iNterpret
The testing of the model indicates a clear relationship between the model variables and the observed cfpb score for the surveyed individuals. The residual graph indicates outliers at the low end and the higher end of the cfpb score. Studying those segments of the population might provide some additional insight into the relevance of the cfpb score to the respective populations.
The cfpb score is based on 10 questions that measure your financial wellbeing. The survey questions are self reported which includes some variability and bias. Self reporting also results in incomplete records as not every individual surveyed answers all of the questions.
Unsurprisingly variables representing savings rate, age and income appear in both models. This is consistent with the general consensus in wealth management. Most financial advisers will highlight the importance of both income and spending to your financial wellbeing but more importantly the impact of time. As important as these variable are to financial wellness they are not able to explain more than 30% or 40% of your wellness score.
It is interesting that race, gender, education, house hold composition and education levels are not part of the model. These attributes . The interrelation between race, gender and education with income is not included in this analysis but could help to explain their absence from the model. These variables could be impacting financial wellness through their impact on income.
Challenges
There were several challenges associated with this assignment including:
- understanding the data - the survey data is well structured however each code book is 100s of pages describing the individual variables. Understanding the data, selecting the variables and filtering the dataset required a significant amount of analysis and additional reading of associated material from the CFPB and Federal Reserve.
- transforming and tiddying the data for each model type - each model type used in the analysis requires descriptive data in a specific format, either numeric or categorical. The random forest and neural network models required some additional analysis to define meaningful categories.
- model selection - efficiently filtering through 100s of survey variable to identify 4 or 5 variables that will construct the mdoel.
Conclusion
The linear regression model was more effective than the two random forest and neural network models. The random forest and neural network models struggled to differentiate cases at the high end and the low end of the cfpb score spectrum.
In reviewing the analysis what jumps out is the impact of early decisions on the overall outcome. The technical execution of the analysis is important but early decision on the data to include or exclude in the analysis has a large impact on the shape of the model. This analysis utilized several models with each model generating slightly different results.
I had not expected to resolve any long standing debates in the financial industry but it has been surprising how many different avenues for analysis or questions arose from this discussion.