Setup
# Exported data load, tidy function and tibble cleaning
fun_file_name <- glue(getwd(), "/Project/DATA606_Functions.R")
source(fun_file_name, local = knitr::knit_global())Abstract
The deregulation of the financial services industry has provided more investment opportunities for individuals and families but it has also increased the level of financial knowledge necessary to make good decisions. It is no coincidence that these trends have driven an increased focus in financial health and financial wellbeing by individuals, employers, government agencies and financial services firms. Understanding financial wellness requires a clear definition of wellness, an approach to measuring differences in wellbeing, and an understanding the factors that influence consumer financial wellness.
This project adopts the definition and measurement methodology developed by the Consumer Financial Protection Bureau as part of their ongoing research efforts. The CFPB defines financial wellness as the “…sense of financial security and freedom of choice—both in the present and when considering the future”. The CFPB has developed a measurement of consumer financial wellbeing called the Financial Wellbeing Scale (CFPB Score). The CFPB methodology includes a survey that not only captures the 10 questions necessary to calculate the CFPB score but includes additional attributes that paint a holistic financial and demographic picture of an individual or family.
This analysis will utilize the data from the CFPB Survey. Starting with the non calculated variables linear regression will be used to evaluate the relationship between non scoring attributes demographic, financial, house hold structure characteristics and the CFPB Score.
(Please note that the 10 questions that are used to calculate the CFPB Score are not included in the attributes evaluated in this analysis)
Part 1 - Introduction
This analysis is focused on understanding the factors that impact consumer financial wellbeing. There are numerous sources of analysis, theory and survey data however for the purposes of this project I will focus on the work from the Consumer Financial Protection Bureau National Financial Well-Being Survey Public Use File (PUF). The PUF survey data allows us to analyze a CFPB Score of consumer wellbeing in association with an individuals financial and demographic characteristics. Linear regression will be used to quantify the impact of individual factors and create a predictive model of financial wellbeing.
Analysis Steps In addition to the outline provided for the project that analysis will follow that steps below.
- Review the survey data including the cfpb score
- Select subset of 217 variables in the survey to analyze
- Explore the relationship between individual variables and the cfpb score
- Fit the full model using linear regression
- Optimize the model using stepwize regression
- Fit a models focused on financial planning variables and demographic variables respectively
- Explore impacts of survey variables on cfpb score
Part 2 - Data
The data was collected as part of the Consumer Financial Protection Bureau’s (CFPB) National Financial Well-Being Survey Public Use File (PUF). The PUF is a dataset containing
- data collected in the National Financial Well-Being Survey,
- data about members of the GfK KnowledgePanel collected prior to the survey, and
- data on poverty levels in respondents’ counties of residence.
The National Financial Well-Being Survey was conducted in English and Spanish via web mode between October 27, 2016 and December 5, 2016. Overall, 6,394 surveys were completed: 5,395 from the general population sample and 999 from an oversample of adults aged 62 and older. The survey was designed to represent the adult population of the 50 U.S. states and the District of Columbia. The survey was fielded on the GfK KnowledgePanel®. The KnowledgePanel sample is recruited using address-based sampling and dual-frame landline and cell phone random digit dialing methods.
The PUF was published in 2017.
Load data and review summary stats
Limited filtering of observations conducted as part of this analysis Other answers such as “refuse to answer” or “do not know” actual contain some relevant information in the context of a individuals financial characteristics. This can be seen in the prevalence of these answers in the dataset roughly 2/3 or the observations include at least one survey answer in the other category.
# full dataset
cfpbRaw_df <- getRawCFPBFile()
# load data - logic for data tidying is associated DATA606_Functions.R file
cfpb_df <- getReducedCFPBFile()
cfpb_df <- cfpb_df %>% droplevels()
# Filter records with invalid cfpb scores
cfpb_df <- cfpb_df %>% filter( cfpb_score > 0)summary(cfpb_df)## cfpb_score econ_hh_income econ_inc_volatility
## Min. :14.00 $100,000 to $149,999:1114 Refused : 61
## 1st Qu.:48.00 $75,000 to $99,999 : 954 Roughly the same :4611
## Median :56.00 $150,000 or more : 861 some unusually spending:1303
## Mean :56.08 Less than $20,000 : 717 varies quite a bit : 414
## 3rd Qu.:65.00 $60,000 to $74,999 : 651
## Max. :95.00 $30,000 to $39,999 : 614
## (Other) :1478
## econ_hh_earners econ_emp_status econ_savings
## Refused : 75 full-time :2491 $5,000-19,999 :1093
## One :2440 Retired :1837 Prefer not to say: 987
## Two :3532 part-time : 432 $1,000-4,999 : 931
## More than two: 342 Self-employed : 418 $20,000-74,999 : 845
## Homemaker : 358 $75,000 or more : 834
## unable to work: 269 $100-999 : 682
## (Other) : 584 (Other) :1017
## hh_status hh_value
## Refused : 39 Question not asked:2226
## I own my home :4163 $150,000-249,999 :1053
## I rent :1642 Less than $150,000:1034
## I do not currently own or rent: 545 $250,000-399,999 : 991
## $400,000 or more : 804
## Prefer not to say : 159
## (Other) : 122
## hh_mortgage per_health
## Question not asked:2226 Refused : 52
## Refused : 56 Poor : 154
## Less than $50,000 :1512 Fair : 780
## $50,000-199,999 :1360 Good :2137
## $200,000 or more : 622 Very good:2582
## I don't know : 115 Excellent: 684
## Prefer not to say : 498
## hh_arrange hh_size
## Refused : 28 1 :1227
## only adult in the household :1128 2 :2702
## spouse/partner/significant other :4204 3 : 996
## my parents' home : 450 4 : 817
## other family, friends, or roommates: 453 5+: 647
## Some other arrangement : 126
##
## dem_marital dem_age dem_generation
## Married :3825 25-34 :1113 Pre-Boomer:1112
## Widowed : 360 45-54 :1074 Boomer :2252
## Divorced/Separated : 687 62-69 :1021 Gen X :1429
## Never married :1149 35-44 : 828 Millennial:1596
## Living with partner: 368 75+ : 736
## 55-61 : 707
## (Other): 910
## dem_edu dem_race dem_gender
## Less than high school : 429 White, Non-Hispanic:4495 Male :3350
## High school degree/GED :1620 Black, Non-Hispanic: 684 Female:3039
## Some college/Associate :1931 Other, Non-Hispanic: 336
## Bachelor's degree :1312 Hispanic : 874
## Graduate/professional degree:1097
##
##
## fin_achieve_goals fin_goals fin_saving_habit
## Refused : 25 Refused: 74 Refused : 16
## Not at all confident: 128 No :2268 Strongly disagree: 297
## Not very confident : 735 Yes :4047 Disagree : 610
## Somewhat confident :3034 Disagree slightly: 659
## Very confident :2467 Agree slightly :1329
## Agree :1728
## Strongly agree :1750
## fin_frugal fin_consult_budget
## Refused : 12 Refused : 6
## Strongly disagree: 22 Strongly disagree : 200
## Disagree : 50 Disagree : 727
## Disagree slightly: 195 Neither agree nor disagree:1330
## Agree slightly : 805 Agree :2776
## Agree :2459 Strongly agree :1350
## Strongly agree :2846
## fin_follow_budget fin_set_goals
## Refused : 6 Refused : 6
## Strongly disagree : 132 Strongly disagree : 127
## Disagree : 653 Disagree : 528
## Neither agree nor disagree:1629 Neither agree nor disagree:1625
## Agree :2985 Agree :3088
## Strongly agree : 984 Strongly agree :1015
##
## fin_clear_plan
## Refused : 6
## Strongly disagree : 265
## Disagree :1197
## Neither agree nor disagree:2208
## Agree :2034
## Strongly agree : 679
##
The cfpb score for this survey data is near normaly distributed with outlines at the high end and low end of the scoring distribution.
# cfpb
ggplot(cfpb_df,
aes(cfpb_score, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", color="red", size = 0.5) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (cfpb dataset)")summary(cfpb_df$cfpb_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 48.00 56.00 56.08 65.00 95.00
House Hold Income
As would be expected the histogram and the box plot for cfpb score categorized by house hold income shows some positive correlation between house hold income and the cfpb_score.
# econ_hh_income
ggplot(cfpb_df, aes(x=econ_hh_income)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "econ_hh_income")ggplot(data = cfpb_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=econ_hh_income, color=econ_hh_income)) +
labs (title="CFPB Score ( by econ_hh_income)", x="CFPB score", y="house hold income")ggplot(cfpb_df,
aes(cfpb_score, fill = econ_hh_income, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by econ_hh_income)")
Age
As would be expected the histogram and the box plot for cfpb score categorized by age shows some positive correlation between house hold income and the cfpb_score. This finding aligns with the typical relationship between networth, income and age.
# dem_age
ggplot(cfpb_df, aes(x=dem_age)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "dem_age")ggplot(data = cfpb_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=dem_age, color=dem_age)) +
labs (title = "CFPB Score (by dem_age)", x="CFPB score", y="age category")ggplot(cfpb_df,
aes(cfpb_score, fill = dem_age, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by dem_age)")
Achieve Financial Goals
The relationship between sfpb_score and confidence in reaching your financial goals is intriguing It highlights a positive relationship between confidence in financial goals and cfpb score. The box plot and histogram show a wide range of survey respondents are optimistic even if their cfpb scores would indicate otherwise.
# fin_achieve_goals
ggplot(cfpb_df, aes(x=fin_achieve_goals)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "fin_achieve_goals")ggplot(data = cfpb_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=fin_achieve_goals, color=fin_achieve_goals)) +
labs (title = "CFPB Score (by fin_achieve_goals)", x="CFPB score", y="belief in achieving fin goals")ggplot(cfpb_df,
aes(cfpb_score, fill = fin_achieve_goals, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by fin_achieve_goals)")
Health
Given the structure of the health care system in the United States it is unsurprising that health shows a positive correlated cfpb score.
# per_health
ggplot(cfpb_df, aes(x=fin_frugal)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "per_health")ggplot(data = cfpb_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=per_health, color=per_health)) +
labs (title = "CFPB Score (by per_health)", x="CFPB score", y="personal health evaluation")ggplot(cfpb_df,
aes(cfpb_score, fill = per_health, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by per_health)")Part 3 - Exploratory data analysis
This analysis uses linear regression to identify and quantify the impacts of survey variables to the cfpb score.
Preconditions for Linear Models
The residual analysis from the CFPB model and Federal Reserve model roughly adheres to the following pre-conditions for linear modeling. More analysis would be required to ensure independence and a linear relationship to the dependent variable. It would be difficult to assert full independence of the predictor variables given that each case reflect an individual. Thre are numerous studies that link demographics, race, geography and age to economic outcomes.
- residuals are nearly normal
- residuals have constant variability
- residuals are independent
- each variable is linearly related to the outcome
Fit Full Model
The full model of all 26 variables included in the analysis is used as a starting point. The resulting model has an adjusted \(R^2\) of 0.5739.
# split data
cfpb_split <- initial_split(cfpb_df, prop = 0.8, strata = cfpb_score)
cfpb_training <- training(cfpb_split)
cfpb_test <- testing(cfpb_split)
# fit full model
cfpb_result1<-lm(cfpb_score ~ . ,data=cfpb_training)
summary(cfpb_result1)##
## Call:
## lm(formula = cfpb_score ~ ., data = cfpb_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.071 -5.526 -0.130 5.542 35.792
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 5.430e+01 4.508e+00 12.046
## econ_hh_income$20,000 to $29,999 2.647e-01 6.267e-01 0.422
## econ_hh_income$30,000 to $39,999 6.645e-01 6.056e-01 1.097
## econ_hh_income$40,000 to $49,999 2.232e+00 6.621e-01 3.372
## econ_hh_income$50,000 to $59,999 2.689e+00 6.640e-01 4.050
## econ_hh_income$60,000 to $74,999 2.700e+00 6.336e-01 4.261
## econ_hh_income$75,000 to $99,999 3.109e+00 6.127e-01 5.075
## econ_hh_income$100,000 to $149,999 3.848e+00 6.282e-01 6.127
## econ_hh_income$150,000 or more 5.365e+00 6.901e-01 7.774
## econ_inc_volatilityRoughly the same -2.284e-01 1.665e+00 -0.137
## econ_inc_volatilitysome unusually spending -1.276e+00 1.680e+00 -0.759
## econ_inc_volatilityvaries quite a bit -1.945e+00 1.737e+00 -1.120
## econ_hh_earnersOne 5.485e-01 1.503e+00 0.365
## econ_hh_earnersTwo 5.072e-01 1.509e+00 0.336
## econ_hh_earnersMore than two 1.597e+00 1.608e+00 0.994
## econ_emp_statusfull-time -3.769e-02 5.675e-01 -0.066
## econ_emp_statuspart-time -3.900e-01 7.182e-01 -0.543
## econ_emp_statusHomemaker 8.975e-01 7.894e-01 1.137
## econ_emp_statusFull-time student 6.551e-02 9.580e-01 0.068
## econ_emp_statusunable to work -8.031e-01 8.810e-01 -0.912
## econ_emp_statusUnemployed -1.983e+00 8.614e-01 -2.302
## econ_emp_statusRetired 1.462e+00 6.576e-01 2.223
## econ_emp_statusRefused -5.139e-01 1.159e+00 -0.443
## econ_savings0 -2.672e+00 2.006e+00 -1.332
## econ_savings$1-99 -3.720e+00 2.007e+00 -1.853
## econ_savings$100-999 -1.831e+00 1.966e+00 -0.931
## econ_savings$1,000-4,999 1.334e+00 1.951e+00 0.684
## econ_savings$5,000-19,999 3.869e+00 1.946e+00 1.988
## econ_savings$20,000-74,999 5.779e+00 1.952e+00 2.960
## econ_savings$75,000 or more 7.855e+00 1.961e+00 4.006
## econ_savingsI don't know 1.483e+00 2.082e+00 0.712
## econ_savingsPrefer not to say 3.630e+00 1.940e+00 1.871
## hh_statusI own my home -2.787e-01 2.714e+00 -0.103
## hh_statusI rent 1.863e-02 2.585e+00 0.007
## hh_statusI do not currently own or rent -2.610e-01 2.620e+00 -0.100
## hh_valueRefused 3.146e+00 3.880e+00 0.811
## hh_valueLess than $150,000 1.238e+00 9.765e-01 1.268
## hh_value$150,000-249,999 1.349e+00 9.716e-01 1.388
## hh_value$250,000-399,999 2.316e+00 9.804e-01 2.362
## hh_value$400,000 or more 2.291e+00 1.015e+00 2.258
## hh_valueI don't know 3.349e-01 1.304e+00 0.257
## hh_valuePrefer not to say NA NA NA
## hh_mortgageRefused 2.391e+00 1.454e+00 1.644
## hh_mortgageLess than $50,000 3.277e-01 6.021e-01 0.544
## hh_mortgage$50,000-199,999 -1.046e+00 6.322e-01 -1.655
## hh_mortgage$200,000 or more -2.131e+00 7.451e-01 -2.859
## hh_mortgageI don't know -1.158e+00 1.114e+00 -1.039
## hh_mortgagePrefer not to say NA NA NA
## per_healthPoor -4.545e+00 1.829e+00 -2.485
## per_healthFair -4.145e+00 1.638e+00 -2.531
## per_healthGood -2.170e+00 1.607e+00 -1.350
## per_healthVery good -4.307e-01 1.603e+00 -0.269
## per_healthExcellent 5.609e-01 1.638e+00 0.342
## hh_arrangeonly adult in the household -6.562e-01 3.240e+00 -0.203
## hh_arrangespouse/partner/significant other 2.313e-01 3.218e+00 0.072
## hh_arrangemy parents' home -2.678e-01 3.275e+00 -0.082
## hh_arrangeother family, friends, or roommates -8.478e-01 3.261e+00 -0.260
## hh_arrangeSome other arrangement 1.794e-01 3.330e+00 0.054
## hh_size2 -4.275e-01 4.668e-01 -0.916
## hh_size3 -9.878e-01 5.403e-01 -1.828
## hh_size4 -9.546e-01 5.823e-01 -1.639
## hh_size5+ -8.636e-01 6.215e-01 -1.389
## dem_maritalWidowed -3.184e-01 7.782e-01 -0.409
## dem_maritalDivorced/Separated -5.844e-01 5.904e-01 -0.990
## dem_maritalNever married 6.354e-01 6.343e-01 1.002
## dem_maritalLiving with partner -1.239e-01 6.142e-01 -0.202
## dem_age25-34 2.128e-01 6.942e-01 0.306
## dem_age35-44 -2.319e-04 1.398e+00 0.000
## dem_age45-54 -7.202e-01 1.508e+00 -0.478
## dem_age55-61 -4.314e-01 1.646e+00 -0.262
## dem_age62-69 2.767e+00 1.661e+00 1.666
## dem_age70-74 4.343e+00 1.913e+00 2.270
## dem_age75+ 5.188e+00 2.029e+00 2.557
## dem_generationBoomer 1.109e+00 1.077e+00 1.029
## dem_generationGen X -3.163e-01 1.260e+00 -0.251
## dem_generationMillennial -8.569e-01 1.794e+00 -0.478
## dem_eduHigh school degree/GED -6.899e-01 5.844e-01 -1.180
## dem_eduSome college/Associate -1.285e+00 5.968e-01 -2.153
## dem_eduBachelor's degree -1.553e+00 6.520e-01 -2.381
## dem_eduGraduate/professional degree -1.292e+00 6.778e-01 -1.906
## dem_raceBlack, Non-Hispanic 9.149e-01 4.522e-01 2.023
## dem_raceOther, Non-Hispanic -9.987e-01 5.780e-01 -1.728
## dem_raceHispanic 1.004e+00 4.135e-01 2.429
## dem_genderFemale -6.294e-02 2.771e-01 -0.227
## fin_achieve_goalsNot at all confident -1.194e+01 2.411e+00 -4.952
## fin_achieve_goalsNot very confident -6.775e+00 2.241e+00 -3.023
## fin_achieve_goalsSomewhat confident -1.739e+00 2.221e+00 -0.783
## fin_achieve_goalsVery confident 4.316e+00 2.233e+00 1.933
## fin_goalsNo 6.817e-01 1.286e+00 0.530
## fin_goalsYes -3.886e-01 1.283e+00 -0.303
## fin_saving_habitStrongly disagree 3.275e+00 4.142e+00 0.791
## fin_saving_habitDisagree 5.988e+00 4.110e+00 1.457
## fin_saving_habitDisagree slightly 7.041e+00 4.104e+00 1.715
## fin_saving_habitAgree slightly 8.211e+00 4.093e+00 2.006
## fin_saving_habitAgree 9.440e+00 4.094e+00 2.306
## fin_saving_habitStrongly agree 1.203e+01 4.093e+00 2.938
## fin_frugalStrongly disagree -1.544e+01 5.340e+00 -2.890
## fin_frugalDisagree -1.400e+01 5.079e+00 -2.758
## fin_frugalDisagree slightly -1.429e+01 4.907e+00 -2.912
## fin_frugalAgree slightly -1.489e+01 4.871e+00 -3.056
## fin_frugalAgree -1.553e+01 4.867e+00 -3.190
## fin_frugalStrongly agree -1.660e+01 4.863e+00 -3.415
## fin_consult_budgetStrongly disagree 1.086e+01 6.233e+00 1.742
## fin_consult_budgetDisagree 8.170e+00 6.168e+00 1.324
## fin_consult_budgetNeither agree nor disagree 6.686e+00 6.153e+00 1.087
## fin_consult_budgetAgree 6.085e+00 6.158e+00 0.988
## fin_consult_budgetStrongly agree 5.439e+00 6.141e+00 0.886
## fin_follow_budgetStrongly disagree -4.406e-01 1.437e+00 -0.307
## fin_follow_budgetDisagree 7.144e-01 7.423e-01 0.962
## fin_follow_budgetNeither agree nor disagree 6.417e-01 6.077e-01 1.056
## fin_follow_budgetAgree -1.450e-02 5.330e-01 -0.027
## fin_follow_budgetStrongly agree NA NA NA
## fin_set_goalsStrongly disagree -3.289e+00 1.382e+00 -2.380
## fin_set_goalsDisagree -1.921e+00 7.674e-01 -2.503
## fin_set_goalsNeither agree nor disagree -1.767e+00 6.022e-01 -2.934
## fin_set_goalsAgree -2.159e+00 5.196e-01 -4.156
## fin_set_goalsStrongly agree NA NA NA
## fin_clear_planStrongly disagree -1.262e+00 1.046e+00 -1.207
## fin_clear_planDisagree -4.698e-01 7.455e-01 -0.630
## fin_clear_planNeither agree nor disagree -8.492e-01 6.827e-01 -1.244
## fin_clear_planAgree -6.931e-01 6.379e-01 -1.086
## fin_clear_planStrongly agree NA NA NA
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## econ_hh_income$20,000 to $29,999 0.672719
## econ_hh_income$30,000 to $39,999 0.272631
## econ_hh_income$40,000 to $49,999 0.000752 ***
## econ_hh_income$50,000 to $59,999 5.20e-05 ***
## econ_hh_income$60,000 to $74,999 2.07e-05 ***
## econ_hh_income$75,000 to $99,999 4.02e-07 ***
## econ_hh_income$100,000 to $149,999 9.67e-10 ***
## econ_hh_income$150,000 or more 9.19e-15 ***
## econ_inc_volatilityRoughly the same 0.890918
## econ_inc_volatilitysome unusually spending 0.447627
## econ_inc_volatilityvaries quite a bit 0.262777
## econ_hh_earnersOne 0.715171
## econ_hh_earnersTwo 0.736794
## econ_hh_earnersMore than two 0.320490
## econ_emp_statusfull-time 0.947052
## econ_emp_statuspart-time 0.587158
## econ_emp_statusHomemaker 0.255619
## econ_emp_statusFull-time student 0.945485
## econ_emp_statusunable to work 0.362001
## econ_emp_statusUnemployed 0.021395 *
## econ_emp_statusRetired 0.026229 *
## econ_emp_statusRefused 0.657612
## econ_savings0 0.182853
## econ_savings$1-99 0.063913 .
## econ_savings$100-999 0.351856
## econ_savings$1,000-4,999 0.494203
## econ_savings$5,000-19,999 0.046920 *
## econ_savings$20,000-74,999 0.003088 **
## econ_savings$75,000 or more 6.26e-05 ***
## econ_savingsI don't know 0.476241
## econ_savingsPrefer not to say 0.061355 .
## hh_statusI own my home 0.918191
## hh_statusI rent 0.994249
## hh_statusI do not currently own or rent 0.920638
## hh_valueRefused 0.417472
## hh_valueLess than $150,000 0.204933
## hh_value$150,000-249,999 0.165212
## hh_value$250,000-399,999 0.018213 *
## hh_value$400,000 or more 0.023974 *
## hh_valueI don't know 0.797270
## hh_valuePrefer not to say NA
## hh_mortgageRefused 0.100214
## hh_mortgageLess than $50,000 0.586221
## hh_mortgage$50,000-199,999 0.098026 .
## hh_mortgage$200,000 or more 0.004262 **
## hh_mortgageI don't know 0.298779
## hh_mortgagePrefer not to say NA
## per_healthPoor 0.013001 *
## per_healthFair 0.011410 *
## per_healthGood 0.176926
## per_healthVery good 0.788116
## per_healthExcellent 0.731991
## hh_arrangeonly adult in the household 0.839491
## hh_arrangespouse/partner/significant other 0.942704
## hh_arrangemy parents' home 0.934829
## hh_arrangeother family, friends, or roommates 0.794903
## hh_arrangeSome other arrangement 0.957031
## hh_size2 0.359733
## hh_size3 0.067547 .
## hh_size4 0.101241
## hh_size5+ 0.164760
## dem_maritalWidowed 0.682487
## dem_maritalDivorced/Separated 0.322359
## dem_maritalNever married 0.316521
## dem_maritalLiving with partner 0.840099
## dem_age25-34 0.759238
## dem_age35-44 0.999868
## dem_age45-54 0.632898
## dem_age55-61 0.793279
## dem_age62-69 0.095749 .
## dem_age70-74 0.023240 *
## dem_age75+ 0.010580 *
## dem_generationBoomer 0.303324
## dem_generationGen X 0.801834
## dem_generationMillennial 0.633020
## dem_eduHigh school degree/GED 0.237891
## dem_eduSome college/Associate 0.031381 *
## dem_eduBachelor's degree 0.017288 *
## dem_eduGraduate/professional degree 0.056695 .
## dem_raceBlack, Non-Hispanic 0.043109 *
## dem_raceOther, Non-Hispanic 0.084098 .
## dem_raceHispanic 0.015164 *
## dem_genderFemale 0.820338
## fin_achieve_goalsNot at all confident 7.60e-07 ***
## fin_achieve_goalsNot very confident 0.002515 **
## fin_achieve_goalsSomewhat confident 0.433602
## fin_achieve_goalsVery confident 0.053264 .
## fin_goalsNo 0.595959
## fin_goalsYes 0.761941
## fin_saving_habitStrongly disagree 0.429257
## fin_saving_habitDisagree 0.145158
## fin_saving_habitDisagree slightly 0.086334 .
## fin_saving_habitAgree slightly 0.044880 *
## fin_saving_habitAgree 0.021163 *
## fin_saving_habitStrongly agree 0.003316 **
## fin_frugalStrongly disagree 0.003864 **
## fin_frugalDisagree 0.005844 **
## fin_frugalDisagree slightly 0.003605 **
## fin_frugalAgree slightly 0.002254 **
## fin_frugalAgree 0.001430 **
## fin_frugalStrongly agree 0.000644 ***
## fin_consult_budgetStrongly disagree 0.081513 .
## fin_consult_budgetDisagree 0.185408
## fin_consult_budgetNeither agree nor disagree 0.277250
## fin_consult_budgetAgree 0.323124
## fin_consult_budgetStrongly agree 0.375875
## fin_follow_budgetStrongly disagree 0.759126
## fin_follow_budgetDisagree 0.335858
## fin_follow_budgetNeither agree nor disagree 0.291068
## fin_follow_budgetAgree 0.978294
## fin_follow_budgetStrongly agree NA
## fin_set_goalsStrongly disagree 0.017368 *
## fin_set_goalsDisagree 0.012343 *
## fin_set_goalsNeither agree nor disagree 0.003361 **
## fin_set_goalsAgree 3.30e-05 ***
## fin_set_goalsStrongly agree NA
## fin_clear_planStrongly disagree 0.227406
## fin_clear_planDisagree 0.528636
## fin_clear_planNeither agree nor disagree 0.213578
## fin_clear_planAgree 0.277313
## fin_clear_planStrongly agree NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.111 on 4993 degrees of freedom
## Multiple R-squared: 0.5836, Adjusted R-squared: 0.5739
## F-statistic: 60.33 on 116 and 4993 DF, p-value: < 2.2e-16
Stepwise model optimization
Starting with the full model that incldued all 26 variables setpwize regression was used to optimize the model.
Part 4 - Inference
Optimized Model
Model stats - The adjuste \(R^2\) for the model is 0.5633 - The residual analysis analysis shows several outliers that are impacting the model
# Step: AIC=23075.37
# cfpb_score ~ econ_hh_income + econ_savings + per_health + dem_age +
# fin_achieve_goals + fin_saving_habit + fin_consult_budget +
# fin_set_goals
# Model Specification
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
# Fitting to Trained Data
lm_fit <- lm_model %>%
fit(cfpb_score ~ econ_hh_income + econ_savings + per_health + dem_age +
fin_achieve_goals + fin_saving_habit + fin_consult_budget +
fin_set_goals
, data = cfpb_training)
# Explore Training Results
names(lm_fit)## [1] "lvl" "spec" "fit" "preproc" "elapsed"
summary(lm_fit$fit)##
## Call:
## stats::lm(formula = cfpb_score ~ econ_hh_income + econ_savings +
## per_health + dem_age + fin_achieve_goals + fin_saving_habit +
## fin_consult_budget + fin_set_goals, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.820 -5.814 -0.140 5.623 35.024
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 52.41177 3.83537 13.665
## econ_hh_income$20,000 to $29,999 -0.10330 0.62024 -0.167
## econ_hh_income$30,000 to $39,999 0.35286 0.59375 0.594
## econ_hh_income$40,000 to $49,999 1.88703 0.64083 2.945
## econ_hh_income$50,000 to $59,999 2.36279 0.63852 3.700
## econ_hh_income$60,000 to $74,999 2.22432 0.59614 3.731
## econ_hh_income$75,000 to $99,999 2.68264 0.55822 4.806
## econ_hh_income$100,000 to $149,999 3.30893 0.55404 5.972
## econ_hh_income$150,000 or more 4.76658 0.59455 8.017
## econ_savings0 -3.45290 1.91836 -1.800
## econ_savings$1-99 -4.75857 1.91987 -2.479
## econ_savings$100-999 -2.77350 1.87634 -1.478
## econ_savings$1,000-4,999 0.39751 1.86456 0.213
## econ_savings$5,000-19,999 3.06818 1.86094 1.649
## econ_savings$20,000-74,999 5.36737 1.86967 2.871
## econ_savings$75,000 or more 7.81470 1.87755 4.162
## econ_savingsI don't know 0.95326 2.00721 0.475
## econ_savingsPrefer not to say 3.07322 1.85501 1.657
## per_healthPoor -5.07145 1.76005 -2.881
## per_healthFair -4.60820 1.56562 -2.943
## per_healthGood -2.44365 1.53806 -1.589
## per_healthVery good -0.82525 1.53449 -0.538
## per_healthExcellent 0.01169 1.56971 0.007
## dem_age25-34 0.11157 0.59899 0.186
## dem_age35-44 0.55183 0.63008 0.876
## dem_age45-54 0.48912 0.60863 0.804
## dem_age55-61 2.18595 0.65510 3.337
## dem_age62-69 6.34868 0.62475 10.162
## dem_age70-74 7.61390 0.70862 10.745
## dem_age75+ 8.46124 0.66936 12.641
## fin_achieve_goalsNot at all confident -12.03539 2.38553 -5.045
## fin_achieve_goalsNot very confident -6.70458 2.20614 -3.039
## fin_achieve_goalsSomewhat confident -1.60568 2.18191 -0.736
## fin_achieve_goalsVery confident 4.65045 2.19378 2.120
## fin_saving_habitStrongly disagree -4.62625 3.62588 -1.276
## fin_saving_habitDisagree -1.71659 3.58829 -0.478
## fin_saving_habitDisagree slightly -0.45937 3.58453 -0.128
## fin_saving_habitAgree slightly 0.66615 3.57125 0.187
## fin_saving_habitAgree 1.74502 3.57127 0.489
## fin_saving_habitStrongly agree 3.97200 3.57159 1.112
## fin_consult_budgetStrongly disagree 3.08194 5.49588 0.561
## fin_consult_budgetDisagree 0.83319 5.44058 0.153
## fin_consult_budgetNeither agree nor disagree -0.68170 5.42771 -0.126
## fin_consult_budgetAgree -1.64487 5.43326 -0.303
## fin_consult_budgetStrongly agree -2.41061 5.42351 -0.444
## fin_set_goalsStrongly disagree -3.56790 1.19038 -2.997
## fin_set_goalsDisagree -1.61324 0.66416 -2.429
## fin_set_goalsNeither agree nor disagree -1.57624 0.51887 -3.038
## fin_set_goalsAgree -2.49662 0.44488 -5.612
## fin_set_goalsStrongly agree NA NA NA
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## econ_hh_income$20,000 to $29,999 0.867726
## econ_hh_income$30,000 to $39,999 0.552346
## econ_hh_income$40,000 to $49,999 0.003248 **
## econ_hh_income$50,000 to $59,999 0.000218 ***
## econ_hh_income$60,000 to $74,999 0.000193 ***
## econ_hh_income$75,000 to $99,999 1.59e-06 ***
## econ_hh_income$100,000 to $149,999 2.50e-09 ***
## econ_hh_income$150,000 or more 1.33e-15 ***
## econ_savings0 0.071932 .
## econ_savings$1-99 0.013223 *
## econ_savings$100-999 0.139432
## econ_savings$1,000-4,999 0.831185
## econ_savings$5,000-19,999 0.099265 .
## econ_savings$20,000-74,999 0.004112 **
## econ_savings$75,000 or more 3.20e-05 ***
## econ_savingsI don't know 0.634867
## econ_savingsPrefer not to say 0.097640 .
## per_healthPoor 0.003975 **
## per_healthFair 0.003261 **
## per_healthGood 0.112172
## per_healthVery good 0.590738
## per_healthExcellent 0.994056
## dem_age25-34 0.852240
## dem_age35-44 0.381176
## dem_age45-54 0.421643
## dem_age55-61 0.000854 ***
## dem_age62-69 < 2e-16 ***
## dem_age70-74 < 2e-16 ***
## dem_age75+ < 2e-16 ***
## fin_achieve_goalsNot at all confident 4.69e-07 ***
## fin_achieve_goalsNot very confident 0.002385 **
## fin_achieve_goalsSomewhat confident 0.461820
## fin_achieve_goalsVery confident 0.034068 *
## fin_saving_habitStrongly disagree 0.202051
## fin_saving_habitDisagree 0.632397
## fin_saving_habitDisagree slightly 0.898031
## fin_saving_habitAgree slightly 0.852035
## fin_saving_habitAgree 0.625127
## fin_saving_habitStrongly agree 0.266143
## fin_consult_budgetStrongly disagree 0.574977
## fin_consult_budgetDisagree 0.878291
## fin_consult_budgetNeither agree nor disagree 0.900057
## fin_consult_budgetAgree 0.762100
## fin_consult_budgetStrongly agree 0.656719
## fin_set_goalsStrongly disagree 0.002737 **
## fin_set_goalsDisagree 0.015176 *
## fin_set_goalsNeither agree nor disagree 0.002395 **
## fin_set_goalsAgree 2.11e-08 ***
## fin_set_goalsStrongly agree NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.223 on 5061 degrees of freedom
## Multiple R-squared: 0.5675, Adjusted R-squared: 0.5633
## F-statistic: 138.3 on 48 and 5061 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_fit$fit, pch = 16, col = '#006EA1')# Data frame of estimated coefficients
tidy(lm_fit)## # A tibble: 50 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 52.4 3.84 13.7 8.94e-42
## 2 econ_hh_income$20,000 to $29,999 -0.103 0.620 -0.167 8.68e- 1
## 3 econ_hh_income$30,000 to $39,999 0.353 0.594 0.594 5.52e- 1
## 4 econ_hh_income$40,000 to $49,999 1.89 0.641 2.94 3.25e- 3
## 5 econ_hh_income$50,000 to $59,999 2.36 0.639 3.70 2.18e- 4
## 6 econ_hh_income$60,000 to $74,999 2.22 0.596 3.73 1.93e- 4
## 7 econ_hh_income$75,000 to $99,999 2.68 0.558 4.81 1.59e- 6
## 8 econ_hh_income$100,000 to $149,999 3.31 0.554 5.97 2.50e- 9
## 9 econ_hh_income$150,000 or more 4.77 0.595 8.02 1.33e-15
## 10 econ_savings0 -3.45 1.92 -1.80 7.19e- 2
## # … with 40 more rows
# Performance metrics on training data
glance(lm_fit) ## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.567 0.563 9.22 138. 0 48 -18579. 37259. 37585.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# variable importance
vip(lm_fit, num_features=15)
Testing Optimized Model
#predict(lm_fit, new_data = cfpb_test)
cfpb_results <- predict(lm_fit, new_data = cfpb_test) %>%
bind_cols(cfpb_test)## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response"): prediction from a rank-deficient fit may be misleading
#cfpb_results
# RMSE on test set
rmse(cfpb_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 9.61
rsq(cfpb_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.561
ggplot(data = cfpb_results,
mapping = aes(x = .pred, y = cfpb_score)) +
geom_point(color = '#006EA1') +
geom_smooth(method = lm) +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results - Optimized',
x = 'Predicted',
y = 'Actual')## `geom_smooth()` using formula 'y ~ x'
# econ_hh_income
ggplot(data = cfpb_results) +
geom_point(mapping = aes(x=cfpb_score, y=econ_hh_income, color=econ_hh_income)) +
labs (title = "econ_hh_income")# econ_savings
ggplot(data = cfpb_results) +
geom_point(mapping = aes(x=cfpb_score, y=econ_savings, color=econ_savings)) +
labs (title = "econ_savings")# per_health
ggplot(data = cfpb_results) +
geom_point(mapping = aes(x=cfpb_score, y=per_health, color=per_health)) +
labs (title = "per_health")# dem_age
ggplot(data = cfpb_results) +
geom_point(mapping = aes(x=cfpb_score, y=dem_age, color=dem_age)) +
labs (title = "dem_age")# fin_achieve_goals
ggplot(data = cfpb_results) +
geom_point(mapping = aes(x=cfpb_score, y=fin_achieve_goals, color=fin_achieve_goals)) +
labs (title = "fin_achieve_goals")# fin_saving_habit
ggplot(data = cfpb_results) +
geom_point(mapping = aes(x=cfpb_score, y=fin_saving_habit, color=fin_saving_habit)) +
labs (title = "fin_saving_habit")# fin_consult_budget
ggplot(data = cfpb_results) +
geom_point(mapping = aes(x=cfpb_score, y=fin_consult_budget, color=fin_consult_budget)) +
labs (title = "fin_consult_budget")# fin_set_goals
ggplot(data = cfpb_results) +
geom_point(mapping = aes(x=cfpb_score, y=fin_set_goals, color=fin_set_goals)) +
labs (title = "fin_set_goals")# cfpb
cfpb_results$.residual <- cfpb_results$cfpb_score - cfpb_results$.pred
ggplot(cfpb_results,
aes(.residual, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(.residual)), linetype = "dashed", color="red", size = 0.5) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Residual Histogram")summary(cfpb_df$cfpb_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 48.00 56.00 56.08 65.00 95.00
cfpb_results %>%
# filter(cfpb_score > 47 & cfpb_score < 66) %>%
ggplot(
aes(y = abs(.residual), x = .pred)) +
geom_point(alpha = 0.7, color="blue",size = 0.1) +
geom_smooth(method = lm) +
#geom_vline(aes(xintercept = mean(.residual)), linetype = "dashed", color="red", size = 0.5) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Residual (vs predictions)")## `geom_smooth()` using formula 'y ~ x'
summary(cfpb_df$cfpb_score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 48.00 56.00 56.08 65.00 95.00
# econ_hh_income
ggplot(data = cfpb_results) +
geom_boxplot(mapping = aes(x=.residual, y=econ_hh_income, color=econ_hh_income)) +
labs (title = "econ_hh_income")# econ_savings
ggplot(data = cfpb_results) +
geom_boxplot(mapping = aes(x=.residual, y=econ_savings, color=econ_savings)) +
labs (title = "econ_savings")# per_health
ggplot(data = cfpb_results) +
geom_boxplot(mapping = aes(x=.residual, y=per_health, color=per_health)) +
labs (title = "per_health")# dem_age
ggplot(data = cfpb_results) +
geom_boxplot(mapping = aes(x=.residual, y=dem_age, color=dem_age)) +
labs (title = "dem_age")# fin_achieve_goals
ggplot(data = cfpb_results) +
geom_boxplot(mapping = aes(x=.residual, y=fin_achieve_goals, color=fin_achieve_goals)) +
labs (title = "fin_achieve_goals")# fin_saving_habit
ggplot(data = cfpb_results) +
geom_boxplot(mapping = aes(x=.residual, y=fin_saving_habit, color=fin_saving_habit)) +
labs (title = "fin_saving_habit")# fin_consult_budget
ggplot(data = cfpb_results) +
geom_boxplot(mapping = aes(x=.residual, y=fin_consult_budget, color=fin_consult_budget)) +
labs (title = "fin_consult_budget")# fin_set_goals
ggplot(data = cfpb_results) +
geom_boxplot(mapping = aes(x=.residual, y=fin_set_goals, color=fin_set_goals)) +
labs (title = "fin_set_goals")Demographic Model
The demographic model that includes race, gender and education has limited predictive power for overall cfpb score on its own It would be interesting to review the impacts of race, gender and education if we held other variables constant.
Model stats - r-squared for the model is 0.08288 - The residual analysis analysis shows several outliers that are impacting the model
# Step: AIC=22814.15
# cfpb_score ~ dem_edu + dem_race +dem_gender
# Model Specification
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
# Fitting to Trained Data
lm_fit <- lm_model %>%
fit(cfpb_score ~ dem_edu + dem_race + dem_gender
, data = cfpb_training)
# Explore Training Results
names(lm_fit)## [1] "lvl" "spec" "fit" "preproc" "elapsed"
summary(lm_fit$fit)##
## Call:
## stats::lm(formula = cfpb_score ~ dem_edu + dem_race + dem_gender,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.443 -8.084 -0.120 8.121 43.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.3671 0.7956 63.310 < 2e-16 ***
## dem_eduHigh school degree/GED 4.7528 0.8220 5.782 7.81e-09 ***
## dem_eduSome college/Associate 5.5119 0.8098 6.807 1.11e-11 ***
## dem_eduBachelor's degree 9.7164 0.8496 11.436 < 2e-16 ***
## dem_eduGraduate/professional degree 12.6420 0.8722 14.495 < 2e-16 ***
## dem_raceBlack, Non-Hispanic -3.4910 0.6202 -5.629 1.91e-08 ***
## dem_raceOther, Non-Hispanic -3.4305 0.8265 -4.151 3.37e-05 ***
## dem_raceHispanic -3.1819 0.5671 -5.611 2.12e-08 ***
## dem_genderFemale -0.6401 0.3790 -1.689 0.0913 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 5101 degrees of freedom
## Multiple R-squared: 0.08432, Adjusted R-squared: 0.08288
## F-statistic: 58.72 on 8 and 5101 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_fit$fit, pch = 16, col = '#006EA1')# Data frame of estimated coefficients
tidy(lm_fit)## # A tibble: 9 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 50.4 0.796 63.3 0
## 2 dem_eduHigh school degree/GED 4.75 0.822 5.78 7.81e- 9
## 3 dem_eduSome college/Associate 5.51 0.810 6.81 1.11e-11
## 4 dem_eduBachelor's degree 9.72 0.850 11.4 6.36e-30
## 5 dem_eduGraduate/professional degree 12.6 0.872 14.5 1.09e-46
## 6 dem_raceBlack, Non-Hispanic -3.49 0.620 -5.63 1.91e- 8
## 7 dem_raceOther, Non-Hispanic -3.43 0.826 -4.15 3.37e- 5
## 8 dem_raceHispanic -3.18 0.567 -5.61 2.12e- 8
## 9 dem_genderFemale -0.640 0.379 -1.69 9.13e- 2
# Performance metrics on training data
glance(lm_fit) ## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0843 0.0829 13.4 58.7 4.49e-92 8 -20495. 41011. 41076.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# variable importance
vip(lm_fit, num_features=15)
Testing Demographic Model
#predict(lm_fit, new_data = cfpb_test)
cfpb_results <- predict(lm_fit, new_data = cfpb_test) %>%
bind_cols(cfpb_test)
#cfpb_results
# RMSE on test set
rmse(cfpb_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 13.9
rsq(cfpb_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.0749
ggplot(data = cfpb_results,
mapping = aes(x = .pred, y = cfpb_score)) +
geom_point(color = '#006EA1') +
geom_smooth(method = lm) +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results - Demographic',
x = 'Predicted',
y = 'Actual')## `geom_smooth()` using formula 'y ~ x'
Financial Planning
The model that focused on financial planning, budgeting and setting goals has some explanatory power. Although 0.2717 is relativily low in comparison to the other models it is significant when considered in the context of the variable that are not included such as income, education and age.
Model stats - r-squared for the model is 0.2717 - The residual analysis analysis shows several outliers that are impacting the model
# Step: AIC=22814.15
# cfpb_score ~ fin_goals + fin_saving_habit + fin_frugal +
# fin_consult_budget + fin_follow_budget + fin_set_goals + fin_clear_plan
# Model Specification
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
# Fitting to Trained Data
lm_fit <- lm_model %>%
fit(cfpb_score ~ fin_goals + fin_saving_habit + fin_frugal + fin_consult_budget +
fin_follow_budget + fin_set_goals + fin_clear_plan
, data = cfpb_training)
# Explore Training Results
names(lm_fit)## [1] "lvl" "spec" "fit" "preproc" "elapsed"
summary(lm_fit$fit)##
## Call:
## stats::lm(formula = cfpb_score ~ fin_goals + fin_saving_habit +
## fin_frugal + fin_consult_budget + fin_follow_budget + fin_set_goals +
## fin_clear_plan, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.224 -7.087 -0.196 7.042 54.220
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 52.2624 4.9314 10.598
## fin_goalsNo 2.1419 1.6357 1.309
## fin_goalsYes 0.1991 1.6296 0.122
## fin_saving_habitStrongly disagree -7.6011 5.3079 -1.432
## fin_saving_habitDisagree -2.0373 5.2728 -0.386
## fin_saving_habitDisagree slightly 0.7912 5.2671 0.150
## fin_saving_habitAgree slightly 5.1534 5.2539 0.981
## fin_saving_habitAgree 10.3637 5.2577 1.971
## fin_saving_habitStrongly agree 15.1631 5.2560 2.885
## fin_frugalStrongly disagree -17.2394 6.8211 -2.527
## fin_frugalDisagree -12.8933 6.4805 -1.990
## fin_frugalDisagree slightly -12.2969 6.2551 -1.966
## fin_frugalAgree slightly -13.0762 6.2088 -2.106
## fin_frugalAgree -11.8171 6.2016 -1.906
## fin_frugalStrongly agree -13.3798 6.1965 -2.159
## fin_consult_budgetStrongly disagree 21.1514 7.6840 2.753
## fin_consult_budgetDisagree 16.6486 7.5975 2.191
## fin_consult_budgetNeither agree nor disagree 14.2768 7.5813 1.883
## fin_consult_budgetAgree 12.7164 7.5853 1.676
## fin_consult_budgetStrongly agree 10.3244 7.5639 1.365
## fin_follow_budgetStrongly disagree -2.7322 1.8653 -1.465
## fin_follow_budgetDisagree 0.9328 0.9581 0.974
## fin_follow_budgetNeither agree nor disagree 0.6871 0.7849 0.875
## fin_follow_budgetAgree 0.3698 0.6901 0.536
## fin_follow_budgetStrongly agree NA NA NA
## fin_set_goalsStrongly disagree -6.3217 1.7734 -3.565
## fin_set_goalsDisagree -5.7318 0.9861 -5.813
## fin_set_goalsNeither agree nor disagree -4.5447 0.7716 -5.890
## fin_set_goalsAgree -3.1480 0.6707 -4.694
## fin_set_goalsStrongly agree NA NA NA
## fin_clear_planStrongly disagree -4.7413 1.3481 -3.517
## fin_clear_planDisagree -2.9043 0.9594 -3.027
## fin_clear_planNeither agree nor disagree -2.8367 0.8794 -3.226
## fin_clear_planAgree -1.7227 0.8266 -2.084
## fin_clear_planStrongly agree NA NA NA
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## fin_goalsNo 0.190453
## fin_goalsYes 0.902768
## fin_saving_habitStrongly disagree 0.152193
## fin_saving_habitDisagree 0.699235
## fin_saving_habitDisagree slightly 0.880598
## fin_saving_habitAgree slightly 0.326701
## fin_saving_habitAgree 0.048764 *
## fin_saving_habitStrongly agree 0.003932 **
## fin_frugalStrongly disagree 0.011523 *
## fin_frugalDisagree 0.046694 *
## fin_frugalDisagree slightly 0.049363 *
## fin_frugalAgree slightly 0.035248 *
## fin_frugalAgree 0.056770 .
## fin_frugalStrongly agree 0.030878 *
## fin_consult_budgetStrongly disagree 0.005932 **
## fin_consult_budgetDisagree 0.028473 *
## fin_consult_budgetNeither agree nor disagree 0.059735 .
## fin_consult_budgetAgree 0.093713 .
## fin_consult_budgetStrongly agree 0.172326
## fin_follow_budgetStrongly disagree 0.143057
## fin_follow_budgetDisagree 0.330316
## fin_follow_budgetNeither agree nor disagree 0.381369
## fin_follow_budgetAgree 0.592045
## fin_follow_budgetStrongly agree NA
## fin_set_goalsStrongly disagree 0.000368 ***
## fin_set_goalsDisagree 6.52e-09 ***
## fin_set_goalsNeither agree nor disagree 4.11e-09 ***
## fin_set_goalsAgree 2.75e-06 ***
## fin_set_goalsStrongly agree NA
## fin_clear_planStrongly disagree 0.000440 ***
## fin_clear_planDisagree 0.002480 **
## fin_clear_planNeither agree nor disagree 0.001265 **
## fin_clear_planAgree 0.037212 *
## fin_clear_planStrongly agree NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.91 on 5078 degrees of freedom
## Multiple R-squared: 0.2761, Adjusted R-squared: 0.2717
## F-statistic: 62.47 on 31 and 5078 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_fit$fit, pch = 16, col = '#006EA1')# Data frame of estimated coefficients
tidy(lm_fit)## # A tibble: 35 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 52.3 4.93 10.6 5.69e-26
## 2 fin_goalsNo 2.14 1.64 1.31 1.90e- 1
## 3 fin_goalsYes 0.199 1.63 0.122 9.03e- 1
## 4 fin_saving_habitStrongly disagree -7.60 5.31 -1.43 1.52e- 1
## 5 fin_saving_habitDisagree -2.04 5.27 -0.386 6.99e- 1
## 6 fin_saving_habitDisagree slightly 0.791 5.27 0.150 8.81e- 1
## 7 fin_saving_habitAgree slightly 5.15 5.25 0.981 3.27e- 1
## 8 fin_saving_habitAgree 10.4 5.26 1.97 4.88e- 2
## 9 fin_saving_habitStrongly agree 15.2 5.26 2.88 3.93e- 3
## 10 fin_frugalStrongly disagree -17.2 6.82 -2.53 1.15e- 2
## # … with 25 more rows
# Performance metrics on training data
glance(lm_fit) ## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.276 0.272 11.9 62.5 0 31 -19895. 39856. 40072.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# variable importance
vip(lm_fit, num_features=15)
Testing Financial Planning Model
#predict(lm_fit, new_data = cfpb_test)
cfpb_results <- predict(lm_fit, new_data = cfpb_test) %>%
bind_cols(cfpb_test)## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response"): prediction from a rank-deficient fit may be misleading
#cfpb_results
# RMSE on test set
rmse(cfpb_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 12.4
rsq(cfpb_results, truth = cfpb_score, estimate = .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.274
ggplot(data = cfpb_results,
mapping = aes(x = .pred, y = cfpb_score)) +
geom_point(color = '#006EA1') +
geom_smooth(method = lm) +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results - CFPB Score',
x = 'Predicted',
y = 'Actual')## `geom_smooth()` using formula 'y ~ x'
# age category
cfpb_results$cfpb_score_10cat <- cut(cfpb_results$cfpb_score, breaks = c(-10, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
labels = c("< 10","10-20","20-30","30-40","40-50","50-60","60-70","70-80","80-90","90-100"),
right = FALSE,
include.lowest=TRUE)
t <- cfpb_results %>% select(cfpb_score,cfpb_score_10cat)
ggplot(cfpb_results,
aes(.pred, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(.pred)), linetype = "dashed", color="red", size = 0.5) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (.pred)")summary(cfpb_df$.pred)## Length Class Mode
## 0 NULL NULL
ggplot(data = cfpb_results) +
geom_boxplot(mapping = aes(x=.pred, y=cfpb_score_10cat, color=cfpb_score_10cat)) +
labs (title = "cfpb_score_10cat")ggplot(cfpb_results,
aes(.pred, fill = cfpb_score_10cat, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(.pred)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by cfpb_score_10cat)")Part 5 - Conclusion
A linear regression model was developed using the results of the Consumer Financial Protection Bureau’s (CFPB) National Financial Well-Being Survey Public Use File (PUF). The model predicts over 0.5608 cfpb score given the 8 variable selected for the optimized model.
Model Performance - adjusted \(R^2\) 0.5608655
- root mean square error 9.609951
Model Variables The following variable are included in the model as significant contributors to the overall model accuracy.
- econ_hh_income (PPINCIMP) - The survey respondents Household Income for a given year.
- econ_savings (SAVINGSRANGES) - An assessment of the amount of savings that the survey respondant currenlty has.
- per_health (HEALTH) - An assessment of the survey respondents overall health.
- dem_age (agecat) - The age of the survey respondent (8 categories).
- fin_achieve_goals (GOALCONF) - An assessment of the ability of the survey respondent to achieve their financial goals.
- fin_saving_habit (SAVEHABIT) - An assessment of savings habits and how often money is put away.
- fin_consult_budget (PROPPLAN_1) - An assessment of how budgeting impacts spending decisions.
- fin_set_goals (PROPPLAN_3) - An assessment of financial planning and goal setting discipline.
The findings for the analysis are not surprising and are consistent with several studies in the industry however deeper analysis is warranted. The cfpb scaled score is designed to cover the entire population however it is not surprising that the resulting model breaks down slightly at the higher and lower ranges of the cfpb score. A more localized analysis of cfpb score ranges might provide some additional insight into the factors that impact consumer financial wellness.
References
There are numerous studies of financial wellness sponsored by financial firms, industry groups and government agencies. I choose to focus on two studies:
- Consumer Financial Protection Bureau - Financial Wellness Survey Financial well-being data The PUF survey results can be accessed as a csv file Financial well-being survey data
Appendix (optional)
CFPB Financial Well-Being Scale Methodology
The CFPB Financial Well-Being Scale methodology uses the following 10 questions to calculate the cfpb score. There is some overlap between the questions used to calculate the cfpb score and the survey questions. The optimized models presented in this analysis did not include any of the variable that overlap with the scoring questions.
How well does this statement describe you or your situation?
- I could handle a major unexpected expense
- I am securing my financial future
- Because of my money situation, I feel like I will never have the things I want in life*
- I can enjoy life because of the way I’m managing my money
- I am just getting by financially*
- I am concerned that the money I have or will save won’t last*
How often does this statement apply to you?
- Giving a gift for a wedding, birthday or other occasion would put a strain on my finances for the month*
- I have money left over at the end of the month
- I am behind with my finances*
- My finances control my life*
The research question is clearly stated, can be answered by the data, and the context of the problem clearly explained.
Review of the variables
The PUF includes financial attitude that were not included as part of the analysis but provide a interesting window into the relationship of financial attitudes to consumer financial wellbeing.
result_df <- getResultsCFPBFile()
result_df <- result_df %>% filter( cfpb_score > 0)# life_satisfied
ggplot(result_df, aes(x=life_satisfied)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "life_satisfied")ggplot(data = result_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=life_satisfied, color=life_satisfied)) +
labs (title = "life_satisfied")ggplot(result_df,
aes(cfpb_score, fill = life_satisfied, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by life_satisfied)")# life_optimistic
ggplot(result_df, aes(x=life_optimistic)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "life_optimistic")ggplot(data = result_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=life_optimistic, color=life_optimistic)) +
labs (title = "life_optimistic")ggplot(result_df,
aes(cfpb_score, fill = life_optimistic, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by life_optimistic)")# material_admire_wealth
ggplot(result_df, aes(x=material_admire_wealth)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "material_admire_wealth")ggplot(data = result_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=material_admire_wealth, color=material_admire_wealth)) +
labs (title = "material_admire_wealth")ggplot(result_df,
aes(cfpb_score, fill = material_admire_wealth, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by material_admire_wealth)")# material_define_worth
ggplot(result_df, aes(x=material_define_worth)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "material_define_worth")ggplot(data = result_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=material_define_worth, color=material_define_worth)) +
labs (title = "material_define_worth")ggplot(result_df,
aes(cfpb_score, fill = material_define_worth, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by material_define_worth)")# material_impress
ggplot(result_df, aes(x=material_impress)) +
geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
coord_flip() +
labs(title = "material_impress")ggplot(data = result_df) +
geom_boxplot(mapping = aes(x=cfpb_score, y=material_impress, color=material_impress)) +
labs (title = "material_impress")ggplot(result_df,
aes(cfpb_score, fill = material_impress, y = stat(density))) +
geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
scale_fill_brewer(palette="Spectral") +
labs(title = "CFPB Score (by material_impress)")