Setup

# Exported data load, tidy function and tibble cleaning
fun_file_name <- glue(getwd(), "/Project/DATA606_Functions.R")
source(fun_file_name, local = knitr::knit_global())

Abstract

The deregulation of the financial services industry has provided more investment opportunities for individuals and families but it has also increased the level of financial knowledge necessary to make good decisions. It is no coincidence that these trends have driven an increased focus in financial health and financial wellbeing by individuals, employers, government agencies and financial services firms. Understanding financial wellness requires a clear definition of wellness, an approach to measuring differences in wellbeing, and an understanding the factors that influence consumer financial wellness.

This project adopts the definition and measurement methodology developed by the Consumer Financial Protection Bureau as part of their ongoing research efforts. The CFPB defines financial wellness as the “…sense of financial security and freedom of choice—both in the present and when considering the future”. The CFPB has developed a measurement of consumer financial wellbeing called the Financial Wellbeing Scale (CFPB Score). The CFPB methodology includes a survey that not only captures the 10 questions necessary to calculate the CFPB score but includes additional attributes that paint a holistic financial and demographic picture of an individual or family.

This analysis will utilize the data from the CFPB Survey. Starting with the non calculated variables linear regression will be used to evaluate the relationship between non scoring attributes demographic, financial, house hold structure characteristics and the CFPB Score.

(Please note that the 10 questions that are used to calculate the CFPB Score are not included in the attributes evaluated in this analysis)

Part 1 - Introduction

This analysis is focused on understanding the factors that impact consumer financial wellbeing. There are numerous sources of analysis, theory and survey data however for the purposes of this project I will focus on the work from the Consumer Financial Protection Bureau National Financial Well-Being Survey Public Use File (PUF). The PUF survey data allows us to analyze a CFPB Score of consumer wellbeing in association with an individuals financial and demographic characteristics. Linear regression will be used to quantify the impact of individual factors and create a predictive model of financial wellbeing.

Analysis Steps In addition to the outline provided for the project that analysis will follow that steps below.

Review the survey data including the cfpb score
Select subset of 217 variables in the survey to analyze
Explore the relationship between individual variables and the cfpb score
Fit the full model using linear regression
Optimize the model using stepwize regression
Fit a models focused on financial planning variables and demographic variables respectively
Explore impacts of survey variables on cfpb score

Part 2 - Data

The data was collected as part of the Consumer Financial Protection Bureau’s (CFPB) National Financial Well-Being Survey Public Use File (PUF). The PUF is a dataset containing

data collected in the National Financial Well-Being Survey,
data about members of the GfK KnowledgePanel collected prior to the survey, and
data on poverty levels in respondents’ counties of residence.

The National Financial Well-Being Survey was conducted in English and Spanish via web mode between October 27, 2016 and December 5, 2016. Overall, 6,394 surveys were completed: 5,395 from the general population sample and 999 from an oversample of adults aged 62 and older. The survey was designed to represent the adult population of the 50 U.S. states and the District of Columbia. The survey was fielded on the GfK KnowledgePanel®. The KnowledgePanel sample is recruited using address-based sampling and dual-frame landline and cell phone random digit dialing methods.

The PUF was published in 2017.

Load data and review summary stats

Limited filtering of observations conducted as part of this analysis Other answers such as “refuse to answer” or “do not know” actual contain some relevant information in the context of a individuals financial characteristics. This can be seen in the prevalence of these answers in the dataset roughly 2/3 or the observations include at least one survey answer in the other category.

# full dataset
cfpbRaw_df <- getRawCFPBFile()


# load data - logic for data tidying is associated DATA606_Functions.R file
cfpb_df <- getReducedCFPBFile()
cfpb_df <- cfpb_df %>% droplevels()


# Filter records with invalid cfpb scores
cfpb_df <- cfpb_df %>% filter( cfpb_score > 0)

summary(cfpb_df)

##    cfpb_score                 econ_hh_income              econ_inc_volatility
##  Min.   :14.00   $100,000 to $149,999:1114   Refused                :  61    
##  1st Qu.:48.00   $75,000 to $99,999  : 954   Roughly the same       :4611    
##  Median :56.00   $150,000 or more    : 861   some unusually spending:1303    
##  Mean   :56.08   Less than $20,000   : 717   varies quite a bit     : 414    
##  3rd Qu.:65.00   $60,000 to $74,999  : 651                                   
##  Max.   :95.00   $30,000 to $39,999  : 614                                   
##                  (Other)             :1478                                   
##       econ_hh_earners       econ_emp_status            econ_savings 
##  Refused      :  75   full-time     :2491   $5,000-19,999    :1093  
##  One          :2440   Retired       :1837   Prefer not to say: 987  
##  Two          :3532   part-time     : 432   $1,000-4,999     : 931  
##  More than two: 342   Self-employed : 418   $20,000-74,999   : 845  
##                       Homemaker     : 358   $75,000 or more  : 834  
##                       unable to work: 269   $100-999         : 682  
##                       (Other)       : 584   (Other)          :1017  
##                           hh_status                  hh_value   
##  Refused                       :  39   Question not asked:2226  
##  I own my home                 :4163   $150,000-249,999  :1053  
##  I rent                        :1642   Less than $150,000:1034  
##  I do not currently own or rent: 545   $250,000-399,999  : 991  
##                                        $400,000 or more  : 804  
##                                        Prefer not to say : 159  
##                                        (Other)           : 122  
##              hh_mortgage       per_health  
##  Question not asked:2226   Refused  :  52  
##  Refused           :  56   Poor     : 154  
##  Less than $50,000 :1512   Fair     : 780  
##  $50,000-199,999   :1360   Good     :2137  
##  $200,000 or more  : 622   Very good:2582  
##  I don't know      : 115   Excellent: 684  
##  Prefer not to say : 498                   
##                                hh_arrange   hh_size  
##  Refused                            :  28   1 :1227  
##  only adult in the household        :1128   2 :2702  
##  spouse/partner/significant other   :4204   3 : 996  
##  my parents' home                   : 450   4 : 817  
##  other family, friends, or roommates: 453   5+: 647  
##  Some other arrangement             : 126            
##                                                      
##               dem_marital      dem_age        dem_generation
##  Married            :3825   25-34  :1113   Pre-Boomer:1112  
##  Widowed            : 360   45-54  :1074   Boomer    :2252  
##  Divorced/Separated : 687   62-69  :1021   Gen X     :1429  
##  Never married      :1149   35-44  : 828   Millennial:1596  
##  Living with partner: 368   75+    : 736                    
##                             55-61  : 707                    
##                             (Other): 910                    
##                          dem_edu                    dem_race     dem_gender  
##  Less than high school       : 429   White, Non-Hispanic:4495   Male  :3350  
##  High school degree/GED      :1620   Black, Non-Hispanic: 684   Female:3039  
##  Some college/Associate      :1931   Other, Non-Hispanic: 336                
##  Bachelor's degree           :1312   Hispanic           : 874                
##  Graduate/professional degree:1097                                           
##                                                                              
##                                                                              
##             fin_achieve_goals   fin_goals             fin_saving_habit
##  Refused             :  25    Refused:  74   Refused          :  16   
##  Not at all confident: 128    No     :2268   Strongly disagree: 297   
##  Not very confident  : 735    Yes    :4047   Disagree         : 610   
##  Somewhat confident  :3034                   Disagree slightly: 659   
##  Very confident      :2467                   Agree slightly   :1329   
##                                              Agree            :1728   
##                                              Strongly agree   :1750   
##              fin_frugal                    fin_consult_budget
##  Refused          :  12   Refused                   :   6    
##  Strongly disagree:  22   Strongly disagree         : 200    
##  Disagree         :  50   Disagree                  : 727    
##  Disagree slightly: 195   Neither agree nor disagree:1330    
##  Agree slightly   : 805   Agree                     :2776    
##  Agree            :2459   Strongly agree            :1350    
##  Strongly agree   :2846                                      
##                   fin_follow_budget                    fin_set_goals 
##  Refused                   :   6    Refused                   :   6  
##  Strongly disagree         : 132    Strongly disagree         : 127  
##  Disagree                  : 653    Disagree                  : 528  
##  Neither agree nor disagree:1629    Neither agree nor disagree:1625  
##  Agree                     :2985    Agree                     :3088  
##  Strongly agree            : 984    Strongly agree            :1015  
##                                                                      
##                     fin_clear_plan
##  Refused                   :   6  
##  Strongly disagree         : 265  
##  Disagree                  :1197  
##  Neither agree nor disagree:2208  
##  Agree                     :2034  
##  Strongly agree            : 679  
##

The cfpb score for this survey data is near normaly distributed with outlines at the high end and low end of the scoring distribution.

# cfpb
ggplot(cfpb_df,
    aes(cfpb_score, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", color="red", size = 0.5) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (cfpb dataset)")

summary(cfpb_df$cfpb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   48.00   56.00   56.08   65.00   95.00

House Hold Income

As would be expected the histogram and the box plot for cfpb score categorized by house hold income shows some positive correlation between house hold income and the cfpb_score.

# econ_hh_income 
ggplot(cfpb_df, aes(x=econ_hh_income)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "econ_hh_income")

ggplot(data = cfpb_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=econ_hh_income, color=econ_hh_income)) +
  labs (title="CFPB Score ( by econ_hh_income)", x="CFPB score", y="house hold income")

ggplot(cfpb_df,
    aes(cfpb_score, fill = econ_hh_income, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by econ_hh_income)")

Age

As would be expected the histogram and the box plot for cfpb score categorized by age shows some positive correlation between house hold income and the cfpb_score. This finding aligns with the typical relationship between networth, income and age.

# dem_age 
ggplot(cfpb_df, aes(x=dem_age)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "dem_age")

ggplot(data = cfpb_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=dem_age, color=dem_age)) +
  labs (title = "CFPB Score (by dem_age)", x="CFPB score", y="age category")

ggplot(cfpb_df,
    aes(cfpb_score, fill = dem_age, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by dem_age)")

Achieve Financial Goals

The relationship between sfpb_score and confidence in reaching your financial goals is intriguing It highlights a positive relationship between confidence in financial goals and cfpb score. The box plot and histogram show a wide range of survey respondents are optimistic even if their cfpb scores would indicate otherwise.

# fin_achieve_goals
ggplot(cfpb_df, aes(x=fin_achieve_goals)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "fin_achieve_goals")

ggplot(data = cfpb_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=fin_achieve_goals, color=fin_achieve_goals)) +
  labs (title = "CFPB Score (by fin_achieve_goals)", x="CFPB score", y="belief in achieving fin goals")

ggplot(cfpb_df,
    aes(cfpb_score, fill = fin_achieve_goals, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by fin_achieve_goals)")

Health

Given the structure of the health care system in the United States it is unsurprising that health shows a positive correlated cfpb score.

# per_health
ggplot(cfpb_df, aes(x=fin_frugal)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "per_health")

ggplot(data = cfpb_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=per_health, color=per_health)) +
  labs (title = "CFPB Score (by per_health)", x="CFPB score", y="personal health evaluation")

ggplot(cfpb_df,
    aes(cfpb_score, fill = per_health, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by per_health)")

Part 3 - Exploratory data analysis

This analysis uses linear regression to identify and quantify the impacts of survey variables to the cfpb score.

Preconditions for Linear Models

The residual analysis from the CFPB model and Federal Reserve model roughly adheres to the following pre-conditions for linear modeling. More analysis would be required to ensure independence and a linear relationship to the dependent variable. It would be difficult to assert full independence of the predictor variables given that each case reflect an individual. Thre are numerous studies that link demographics, race, geography and age to economic outcomes.

residuals are nearly normal
residuals have constant variability
residuals are independent
each variable is linearly related to the outcome

Fit Full Model

The full model of all 26 variables included in the analysis is used as a starting point. The resulting model has an adjusted \(R^2\) of 0.5739.

# split data
cfpb_split <- initial_split(cfpb_df, prop = 0.8, strata = cfpb_score)
cfpb_training <-  training(cfpb_split)
cfpb_test <-  testing(cfpb_split)



# fit full model
cfpb_result1<-lm(cfpb_score ~ . ,data=cfpb_training)
summary(cfpb_result1)

## 
## Call:
## lm(formula = cfpb_score ~ ., data = cfpb_training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.071  -5.526  -0.130   5.542  35.792 
## 
## Coefficients: (5 not defined because of singularities)
##                                                 Estimate Std. Error t value
## (Intercept)                                    5.430e+01  4.508e+00  12.046
## econ_hh_income$20,000 to $29,999               2.647e-01  6.267e-01   0.422
## econ_hh_income$30,000 to $39,999               6.645e-01  6.056e-01   1.097
## econ_hh_income$40,000 to $49,999               2.232e+00  6.621e-01   3.372
## econ_hh_income$50,000 to $59,999               2.689e+00  6.640e-01   4.050
## econ_hh_income$60,000 to $74,999               2.700e+00  6.336e-01   4.261
## econ_hh_income$75,000 to $99,999               3.109e+00  6.127e-01   5.075
## econ_hh_income$100,000 to $149,999             3.848e+00  6.282e-01   6.127
## econ_hh_income$150,000 or more                 5.365e+00  6.901e-01   7.774
## econ_inc_volatilityRoughly the same           -2.284e-01  1.665e+00  -0.137
## econ_inc_volatilitysome unusually spending    -1.276e+00  1.680e+00  -0.759
## econ_inc_volatilityvaries quite a bit         -1.945e+00  1.737e+00  -1.120
## econ_hh_earnersOne                             5.485e-01  1.503e+00   0.365
## econ_hh_earnersTwo                             5.072e-01  1.509e+00   0.336
## econ_hh_earnersMore than two                   1.597e+00  1.608e+00   0.994
## econ_emp_statusfull-time                      -3.769e-02  5.675e-01  -0.066
## econ_emp_statuspart-time                      -3.900e-01  7.182e-01  -0.543
## econ_emp_statusHomemaker                       8.975e-01  7.894e-01   1.137
## econ_emp_statusFull-time student               6.551e-02  9.580e-01   0.068
## econ_emp_statusunable to work                 -8.031e-01  8.810e-01  -0.912
## econ_emp_statusUnemployed                     -1.983e+00  8.614e-01  -2.302
## econ_emp_statusRetired                         1.462e+00  6.576e-01   2.223
## econ_emp_statusRefused                        -5.139e-01  1.159e+00  -0.443
## econ_savings0                                 -2.672e+00  2.006e+00  -1.332
## econ_savings$1-99                             -3.720e+00  2.007e+00  -1.853
## econ_savings$100-999                          -1.831e+00  1.966e+00  -0.931
## econ_savings$1,000-4,999                       1.334e+00  1.951e+00   0.684
## econ_savings$5,000-19,999                      3.869e+00  1.946e+00   1.988
## econ_savings$20,000-74,999                     5.779e+00  1.952e+00   2.960
## econ_savings$75,000 or more                    7.855e+00  1.961e+00   4.006
## econ_savingsI don't know                       1.483e+00  2.082e+00   0.712
## econ_savingsPrefer not to say                  3.630e+00  1.940e+00   1.871
## hh_statusI own my home                        -2.787e-01  2.714e+00  -0.103
## hh_statusI rent                                1.863e-02  2.585e+00   0.007
## hh_statusI do not currently own or rent       -2.610e-01  2.620e+00  -0.100
## hh_valueRefused                                3.146e+00  3.880e+00   0.811
## hh_valueLess than $150,000                     1.238e+00  9.765e-01   1.268
## hh_value$150,000-249,999                       1.349e+00  9.716e-01   1.388
## hh_value$250,000-399,999                       2.316e+00  9.804e-01   2.362
## hh_value$400,000 or more                       2.291e+00  1.015e+00   2.258
## hh_valueI don't know                           3.349e-01  1.304e+00   0.257
## hh_valuePrefer not to say                             NA         NA      NA
## hh_mortgageRefused                             2.391e+00  1.454e+00   1.644
## hh_mortgageLess than $50,000                   3.277e-01  6.021e-01   0.544
## hh_mortgage$50,000-199,999                    -1.046e+00  6.322e-01  -1.655
## hh_mortgage$200,000 or more                   -2.131e+00  7.451e-01  -2.859
## hh_mortgageI don't know                       -1.158e+00  1.114e+00  -1.039
## hh_mortgagePrefer not to say                          NA         NA      NA
## per_healthPoor                                -4.545e+00  1.829e+00  -2.485
## per_healthFair                                -4.145e+00  1.638e+00  -2.531
## per_healthGood                                -2.170e+00  1.607e+00  -1.350
## per_healthVery good                           -4.307e-01  1.603e+00  -0.269
## per_healthExcellent                            5.609e-01  1.638e+00   0.342
## hh_arrangeonly adult in the household         -6.562e-01  3.240e+00  -0.203
## hh_arrangespouse/partner/significant other     2.313e-01  3.218e+00   0.072
## hh_arrangemy parents' home                    -2.678e-01  3.275e+00  -0.082
## hh_arrangeother family, friends, or roommates -8.478e-01  3.261e+00  -0.260
## hh_arrangeSome other arrangement               1.794e-01  3.330e+00   0.054
## hh_size2                                      -4.275e-01  4.668e-01  -0.916
## hh_size3                                      -9.878e-01  5.403e-01  -1.828
## hh_size4                                      -9.546e-01  5.823e-01  -1.639
## hh_size5+                                     -8.636e-01  6.215e-01  -1.389
## dem_maritalWidowed                            -3.184e-01  7.782e-01  -0.409
## dem_maritalDivorced/Separated                 -5.844e-01  5.904e-01  -0.990
## dem_maritalNever married                       6.354e-01  6.343e-01   1.002
## dem_maritalLiving with partner                -1.239e-01  6.142e-01  -0.202
## dem_age25-34                                   2.128e-01  6.942e-01   0.306
## dem_age35-44                                  -2.319e-04  1.398e+00   0.000
## dem_age45-54                                  -7.202e-01  1.508e+00  -0.478
## dem_age55-61                                  -4.314e-01  1.646e+00  -0.262
## dem_age62-69                                   2.767e+00  1.661e+00   1.666
## dem_age70-74                                   4.343e+00  1.913e+00   2.270
## dem_age75+                                     5.188e+00  2.029e+00   2.557
## dem_generationBoomer                           1.109e+00  1.077e+00   1.029
## dem_generationGen X                           -3.163e-01  1.260e+00  -0.251
## dem_generationMillennial                      -8.569e-01  1.794e+00  -0.478
## dem_eduHigh school degree/GED                 -6.899e-01  5.844e-01  -1.180
## dem_eduSome college/Associate                 -1.285e+00  5.968e-01  -2.153
## dem_eduBachelor's degree                      -1.553e+00  6.520e-01  -2.381
## dem_eduGraduate/professional degree           -1.292e+00  6.778e-01  -1.906
## dem_raceBlack, Non-Hispanic                    9.149e-01  4.522e-01   2.023
## dem_raceOther, Non-Hispanic                   -9.987e-01  5.780e-01  -1.728
## dem_raceHispanic                               1.004e+00  4.135e-01   2.429
## dem_genderFemale                              -6.294e-02  2.771e-01  -0.227
## fin_achieve_goalsNot at all confident         -1.194e+01  2.411e+00  -4.952
## fin_achieve_goalsNot very confident           -6.775e+00  2.241e+00  -3.023
## fin_achieve_goalsSomewhat confident           -1.739e+00  2.221e+00  -0.783
## fin_achieve_goalsVery confident                4.316e+00  2.233e+00   1.933
## fin_goalsNo                                    6.817e-01  1.286e+00   0.530
## fin_goalsYes                                  -3.886e-01  1.283e+00  -0.303
## fin_saving_habitStrongly disagree              3.275e+00  4.142e+00   0.791
## fin_saving_habitDisagree                       5.988e+00  4.110e+00   1.457
## fin_saving_habitDisagree slightly              7.041e+00  4.104e+00   1.715
## fin_saving_habitAgree slightly                 8.211e+00  4.093e+00   2.006
## fin_saving_habitAgree                          9.440e+00  4.094e+00   2.306
## fin_saving_habitStrongly agree                 1.203e+01  4.093e+00   2.938
## fin_frugalStrongly disagree                   -1.544e+01  5.340e+00  -2.890
## fin_frugalDisagree                            -1.400e+01  5.079e+00  -2.758
## fin_frugalDisagree slightly                   -1.429e+01  4.907e+00  -2.912
## fin_frugalAgree slightly                      -1.489e+01  4.871e+00  -3.056
## fin_frugalAgree                               -1.553e+01  4.867e+00  -3.190
## fin_frugalStrongly agree                      -1.660e+01  4.863e+00  -3.415
## fin_consult_budgetStrongly disagree            1.086e+01  6.233e+00   1.742
## fin_consult_budgetDisagree                     8.170e+00  6.168e+00   1.324
## fin_consult_budgetNeither agree nor disagree   6.686e+00  6.153e+00   1.087
## fin_consult_budgetAgree                        6.085e+00  6.158e+00   0.988
## fin_consult_budgetStrongly agree               5.439e+00  6.141e+00   0.886
## fin_follow_budgetStrongly disagree            -4.406e-01  1.437e+00  -0.307
## fin_follow_budgetDisagree                      7.144e-01  7.423e-01   0.962
## fin_follow_budgetNeither agree nor disagree    6.417e-01  6.077e-01   1.056
## fin_follow_budgetAgree                        -1.450e-02  5.330e-01  -0.027
## fin_follow_budgetStrongly agree                       NA         NA      NA
## fin_set_goalsStrongly disagree                -3.289e+00  1.382e+00  -2.380
## fin_set_goalsDisagree                         -1.921e+00  7.674e-01  -2.503
## fin_set_goalsNeither agree nor disagree       -1.767e+00  6.022e-01  -2.934
## fin_set_goalsAgree                            -2.159e+00  5.196e-01  -4.156
## fin_set_goalsStrongly agree                           NA         NA      NA
## fin_clear_planStrongly disagree               -1.262e+00  1.046e+00  -1.207
## fin_clear_planDisagree                        -4.698e-01  7.455e-01  -0.630
## fin_clear_planNeither agree nor disagree      -8.492e-01  6.827e-01  -1.244
## fin_clear_planAgree                           -6.931e-01  6.379e-01  -1.086
## fin_clear_planStrongly agree                          NA         NA      NA
##                                               Pr(>|t|)    
## (Intercept)                                    < 2e-16 ***
## econ_hh_income$20,000 to $29,999              0.672719    
## econ_hh_income$30,000 to $39,999              0.272631    
## econ_hh_income$40,000 to $49,999              0.000752 ***
## econ_hh_income$50,000 to $59,999              5.20e-05 ***
## econ_hh_income$60,000 to $74,999              2.07e-05 ***
## econ_hh_income$75,000 to $99,999              4.02e-07 ***
## econ_hh_income$100,000 to $149,999            9.67e-10 ***
## econ_hh_income$150,000 or more                9.19e-15 ***
## econ_inc_volatilityRoughly the same           0.890918    
## econ_inc_volatilitysome unusually spending    0.447627    
## econ_inc_volatilityvaries quite a bit         0.262777    
## econ_hh_earnersOne                            0.715171    
## econ_hh_earnersTwo                            0.736794    
## econ_hh_earnersMore than two                  0.320490    
## econ_emp_statusfull-time                      0.947052    
## econ_emp_statuspart-time                      0.587158    
## econ_emp_statusHomemaker                      0.255619    
## econ_emp_statusFull-time student              0.945485    
## econ_emp_statusunable to work                 0.362001    
## econ_emp_statusUnemployed                     0.021395 *  
## econ_emp_statusRetired                        0.026229 *  
## econ_emp_statusRefused                        0.657612    
## econ_savings0                                 0.182853    
## econ_savings$1-99                             0.063913 .  
## econ_savings$100-999                          0.351856    
## econ_savings$1,000-4,999                      0.494203    
## econ_savings$5,000-19,999                     0.046920 *  
## econ_savings$20,000-74,999                    0.003088 ** 
## econ_savings$75,000 or more                   6.26e-05 ***
## econ_savingsI don't know                      0.476241    
## econ_savingsPrefer not to say                 0.061355 .  
## hh_statusI own my home                        0.918191    
## hh_statusI rent                               0.994249    
## hh_statusI do not currently own or rent       0.920638    
## hh_valueRefused                               0.417472    
## hh_valueLess than $150,000                    0.204933    
## hh_value$150,000-249,999                      0.165212    
## hh_value$250,000-399,999                      0.018213 *  
## hh_value$400,000 or more                      0.023974 *  
## hh_valueI don't know                          0.797270    
## hh_valuePrefer not to say                           NA    
## hh_mortgageRefused                            0.100214    
## hh_mortgageLess than $50,000                  0.586221    
## hh_mortgage$50,000-199,999                    0.098026 .  
## hh_mortgage$200,000 or more                   0.004262 ** 
## hh_mortgageI don't know                       0.298779    
## hh_mortgagePrefer not to say                        NA    
## per_healthPoor                                0.013001 *  
## per_healthFair                                0.011410 *  
## per_healthGood                                0.176926    
## per_healthVery good                           0.788116    
## per_healthExcellent                           0.731991    
## hh_arrangeonly adult in the household         0.839491    
## hh_arrangespouse/partner/significant other    0.942704    
## hh_arrangemy parents' home                    0.934829    
## hh_arrangeother family, friends, or roommates 0.794903    
## hh_arrangeSome other arrangement              0.957031    
## hh_size2                                      0.359733    
## hh_size3                                      0.067547 .  
## hh_size4                                      0.101241    
## hh_size5+                                     0.164760    
## dem_maritalWidowed                            0.682487    
## dem_maritalDivorced/Separated                 0.322359    
## dem_maritalNever married                      0.316521    
## dem_maritalLiving with partner                0.840099    
## dem_age25-34                                  0.759238    
## dem_age35-44                                  0.999868    
## dem_age45-54                                  0.632898    
## dem_age55-61                                  0.793279    
## dem_age62-69                                  0.095749 .  
## dem_age70-74                                  0.023240 *  
## dem_age75+                                    0.010580 *  
## dem_generationBoomer                          0.303324    
## dem_generationGen X                           0.801834    
## dem_generationMillennial                      0.633020    
## dem_eduHigh school degree/GED                 0.237891    
## dem_eduSome college/Associate                 0.031381 *  
## dem_eduBachelor's degree                      0.017288 *  
## dem_eduGraduate/professional degree           0.056695 .  
## dem_raceBlack, Non-Hispanic                   0.043109 *  
## dem_raceOther, Non-Hispanic                   0.084098 .  
## dem_raceHispanic                              0.015164 *  
## dem_genderFemale                              0.820338    
## fin_achieve_goalsNot at all confident         7.60e-07 ***
## fin_achieve_goalsNot very confident           0.002515 ** 
## fin_achieve_goalsSomewhat confident           0.433602    
## fin_achieve_goalsVery confident               0.053264 .  
## fin_goalsNo                                   0.595959    
## fin_goalsYes                                  0.761941    
## fin_saving_habitStrongly disagree             0.429257    
## fin_saving_habitDisagree                      0.145158    
## fin_saving_habitDisagree slightly             0.086334 .  
## fin_saving_habitAgree slightly                0.044880 *  
## fin_saving_habitAgree                         0.021163 *  
## fin_saving_habitStrongly agree                0.003316 ** 
## fin_frugalStrongly disagree                   0.003864 ** 
## fin_frugalDisagree                            0.005844 ** 
## fin_frugalDisagree slightly                   0.003605 ** 
## fin_frugalAgree slightly                      0.002254 ** 
## fin_frugalAgree                               0.001430 ** 
## fin_frugalStrongly agree                      0.000644 ***
## fin_consult_budgetStrongly disagree           0.081513 .  
## fin_consult_budgetDisagree                    0.185408    
## fin_consult_budgetNeither agree nor disagree  0.277250    
## fin_consult_budgetAgree                       0.323124    
## fin_consult_budgetStrongly agree              0.375875    
## fin_follow_budgetStrongly disagree            0.759126    
## fin_follow_budgetDisagree                     0.335858    
## fin_follow_budgetNeither agree nor disagree   0.291068    
## fin_follow_budgetAgree                        0.978294    
## fin_follow_budgetStrongly agree                     NA    
## fin_set_goalsStrongly disagree                0.017368 *  
## fin_set_goalsDisagree                         0.012343 *  
## fin_set_goalsNeither agree nor disagree       0.003361 ** 
## fin_set_goalsAgree                            3.30e-05 ***
## fin_set_goalsStrongly agree                         NA    
## fin_clear_planStrongly disagree               0.227406    
## fin_clear_planDisagree                        0.528636    
## fin_clear_planNeither agree nor disagree      0.213578    
## fin_clear_planAgree                           0.277313    
## fin_clear_planStrongly agree                        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.111 on 4993 degrees of freedom
## Multiple R-squared:  0.5836, Adjusted R-squared:  0.5739 
## F-statistic: 60.33 on 116 and 4993 DF,  p-value: < 2.2e-16

Stepwise model optimization

Starting with the full model that incldued all 26 variables setpwize regression was used to optimize the model.

Part 4 - Inference

Optimized Model

Model stats - The adjuste \(R^2\) for the model is 0.5633 - The residual analysis analysis shows several outliers that are impacting the model

# Step:  AIC=23075.37
# cfpb_score ~ econ_hh_income + econ_savings + per_health + dem_age + 
#     fin_achieve_goals + fin_saving_habit + fin_consult_budget + 
#     fin_set_goals


# Model Specification
lm_model <- linear_reg() %>% 
            set_engine('lm') %>% 
            set_mode('regression')


# Fitting to Trained Data
lm_fit <- lm_model %>% 
          fit(cfpb_score ~ econ_hh_income + econ_savings + per_health + dem_age + 
                            fin_achieve_goals + fin_saving_habit + fin_consult_budget + 
                            fin_set_goals
              
              , data = cfpb_training)


# Explore Training Results
names(lm_fit)

## [1] "lvl"     "spec"    "fit"     "preproc" "elapsed"

summary(lm_fit$fit)

## 
## Call:
## stats::lm(formula = cfpb_score ~ econ_hh_income + econ_savings + 
##     per_health + dem_age + fin_achieve_goals + fin_saving_habit + 
##     fin_consult_budget + fin_set_goals, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.820  -5.814  -0.140   5.623  35.024 
## 
## Coefficients: (1 not defined because of singularities)
##                                               Estimate Std. Error t value
## (Intercept)                                   52.41177    3.83537  13.665
## econ_hh_income$20,000 to $29,999              -0.10330    0.62024  -0.167
## econ_hh_income$30,000 to $39,999               0.35286    0.59375   0.594
## econ_hh_income$40,000 to $49,999               1.88703    0.64083   2.945
## econ_hh_income$50,000 to $59,999               2.36279    0.63852   3.700
## econ_hh_income$60,000 to $74,999               2.22432    0.59614   3.731
## econ_hh_income$75,000 to $99,999               2.68264    0.55822   4.806
## econ_hh_income$100,000 to $149,999             3.30893    0.55404   5.972
## econ_hh_income$150,000 or more                 4.76658    0.59455   8.017
## econ_savings0                                 -3.45290    1.91836  -1.800
## econ_savings$1-99                             -4.75857    1.91987  -2.479
## econ_savings$100-999                          -2.77350    1.87634  -1.478
## econ_savings$1,000-4,999                       0.39751    1.86456   0.213
## econ_savings$5,000-19,999                      3.06818    1.86094   1.649
## econ_savings$20,000-74,999                     5.36737    1.86967   2.871
## econ_savings$75,000 or more                    7.81470    1.87755   4.162
## econ_savingsI don't know                       0.95326    2.00721   0.475
## econ_savingsPrefer not to say                  3.07322    1.85501   1.657
## per_healthPoor                                -5.07145    1.76005  -2.881
## per_healthFair                                -4.60820    1.56562  -2.943
## per_healthGood                                -2.44365    1.53806  -1.589
## per_healthVery good                           -0.82525    1.53449  -0.538
## per_healthExcellent                            0.01169    1.56971   0.007
## dem_age25-34                                   0.11157    0.59899   0.186
## dem_age35-44                                   0.55183    0.63008   0.876
## dem_age45-54                                   0.48912    0.60863   0.804
## dem_age55-61                                   2.18595    0.65510   3.337
## dem_age62-69                                   6.34868    0.62475  10.162
## dem_age70-74                                   7.61390    0.70862  10.745
## dem_age75+                                     8.46124    0.66936  12.641
## fin_achieve_goalsNot at all confident        -12.03539    2.38553  -5.045
## fin_achieve_goalsNot very confident           -6.70458    2.20614  -3.039
## fin_achieve_goalsSomewhat confident           -1.60568    2.18191  -0.736
## fin_achieve_goalsVery confident                4.65045    2.19378   2.120
## fin_saving_habitStrongly disagree             -4.62625    3.62588  -1.276
## fin_saving_habitDisagree                      -1.71659    3.58829  -0.478
## fin_saving_habitDisagree slightly             -0.45937    3.58453  -0.128
## fin_saving_habitAgree slightly                 0.66615    3.57125   0.187
## fin_saving_habitAgree                          1.74502    3.57127   0.489
## fin_saving_habitStrongly agree                 3.97200    3.57159   1.112
## fin_consult_budgetStrongly disagree            3.08194    5.49588   0.561
## fin_consult_budgetDisagree                     0.83319    5.44058   0.153
## fin_consult_budgetNeither agree nor disagree  -0.68170    5.42771  -0.126
## fin_consult_budgetAgree                       -1.64487    5.43326  -0.303
## fin_consult_budgetStrongly agree              -2.41061    5.42351  -0.444
## fin_set_goalsStrongly disagree                -3.56790    1.19038  -2.997
## fin_set_goalsDisagree                         -1.61324    0.66416  -2.429
## fin_set_goalsNeither agree nor disagree       -1.57624    0.51887  -3.038
## fin_set_goalsAgree                            -2.49662    0.44488  -5.612
## fin_set_goalsStrongly agree                         NA         NA      NA
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## econ_hh_income$20,000 to $29,999             0.867726    
## econ_hh_income$30,000 to $39,999             0.552346    
## econ_hh_income$40,000 to $49,999             0.003248 ** 
## econ_hh_income$50,000 to $59,999             0.000218 ***
## econ_hh_income$60,000 to $74,999             0.000193 ***
## econ_hh_income$75,000 to $99,999             1.59e-06 ***
## econ_hh_income$100,000 to $149,999           2.50e-09 ***
## econ_hh_income$150,000 or more               1.33e-15 ***
## econ_savings0                                0.071932 .  
## econ_savings$1-99                            0.013223 *  
## econ_savings$100-999                         0.139432    
## econ_savings$1,000-4,999                     0.831185    
## econ_savings$5,000-19,999                    0.099265 .  
## econ_savings$20,000-74,999                   0.004112 ** 
## econ_savings$75,000 or more                  3.20e-05 ***
## econ_savingsI don't know                     0.634867    
## econ_savingsPrefer not to say                0.097640 .  
## per_healthPoor                               0.003975 ** 
## per_healthFair                               0.003261 ** 
## per_healthGood                               0.112172    
## per_healthVery good                          0.590738    
## per_healthExcellent                          0.994056    
## dem_age25-34                                 0.852240    
## dem_age35-44                                 0.381176    
## dem_age45-54                                 0.421643    
## dem_age55-61                                 0.000854 ***
## dem_age62-69                                  < 2e-16 ***
## dem_age70-74                                  < 2e-16 ***
## dem_age75+                                    < 2e-16 ***
## fin_achieve_goalsNot at all confident        4.69e-07 ***
## fin_achieve_goalsNot very confident          0.002385 ** 
## fin_achieve_goalsSomewhat confident          0.461820    
## fin_achieve_goalsVery confident              0.034068 *  
## fin_saving_habitStrongly disagree            0.202051    
## fin_saving_habitDisagree                     0.632397    
## fin_saving_habitDisagree slightly            0.898031    
## fin_saving_habitAgree slightly               0.852035    
## fin_saving_habitAgree                        0.625127    
## fin_saving_habitStrongly agree               0.266143    
## fin_consult_budgetStrongly disagree          0.574977    
## fin_consult_budgetDisagree                   0.878291    
## fin_consult_budgetNeither agree nor disagree 0.900057    
## fin_consult_budgetAgree                      0.762100    
## fin_consult_budgetStrongly agree             0.656719    
## fin_set_goalsStrongly disagree               0.002737 ** 
## fin_set_goalsDisagree                        0.015176 *  
## fin_set_goalsNeither agree nor disagree      0.002395 ** 
## fin_set_goalsAgree                           2.11e-08 ***
## fin_set_goalsStrongly agree                        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.223 on 5061 degrees of freedom
## Multiple R-squared:  0.5675, Adjusted R-squared:  0.5633 
## F-statistic: 138.3 on 48 and 5061 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)) 
plot(lm_fit$fit, pch = 16, col = '#006EA1')

# Data frame of estimated coefficients
tidy(lm_fit)

## # A tibble: 50 × 5
##    term                               estimate std.error statistic  p.value
##    <chr>                                 <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                          52.4       3.84     13.7   8.94e-42
##  2 econ_hh_income$20,000 to $29,999     -0.103     0.620    -0.167 8.68e- 1
##  3 econ_hh_income$30,000 to $39,999      0.353     0.594     0.594 5.52e- 1
##  4 econ_hh_income$40,000 to $49,999      1.89      0.641     2.94  3.25e- 3
##  5 econ_hh_income$50,000 to $59,999      2.36      0.639     3.70  2.18e- 4
##  6 econ_hh_income$60,000 to $74,999      2.22      0.596     3.73  1.93e- 4
##  7 econ_hh_income$75,000 to $99,999      2.68      0.558     4.81  1.59e- 6
##  8 econ_hh_income$100,000 to $149,999    3.31      0.554     5.97  2.50e- 9
##  9 econ_hh_income$150,000 or more        4.77      0.595     8.02  1.33e-15
## 10 econ_savings0                        -3.45      1.92     -1.80  7.19e- 2
## # … with 40 more rows

# Performance metrics on training data
glance(lm_fit)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.567         0.563  9.22      138.       0    48 -18579. 37259. 37585.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# variable importance
vip(lm_fit, num_features=15)

Testing Optimized Model

#predict(lm_fit, new_data = cfpb_test)

cfpb_results <- predict(lm_fit, new_data = cfpb_test) %>% 
                            bind_cols(cfpb_test)

## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response"): prediction from a rank-deficient fit may be misleading

#cfpb_results

# RMSE on test set
rmse(cfpb_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        9.61

rsq(cfpb_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.561

ggplot(data = cfpb_results,
       mapping = aes(x = .pred, y = cfpb_score)) +
  geom_point(color = '#006EA1') +
  geom_smooth(method = lm) +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Optimized',
       x = 'Predicted',
       y = 'Actual')

## `geom_smooth()` using formula 'y ~ x'

# econ_hh_income
ggplot(data = cfpb_results) +
  geom_point(mapping = aes(x=cfpb_score, y=econ_hh_income, color=econ_hh_income)) +
  labs (title = "econ_hh_income")

# econ_savings
ggplot(data = cfpb_results) +
  geom_point(mapping = aes(x=cfpb_score, y=econ_savings, color=econ_savings)) +
  labs (title = "econ_savings")

# per_health
ggplot(data = cfpb_results) +
  geom_point(mapping = aes(x=cfpb_score, y=per_health, color=per_health)) +
  labs (title = "per_health")

# dem_age
ggplot(data = cfpb_results) +
  geom_point(mapping = aes(x=cfpb_score, y=dem_age, color=dem_age)) +
  labs (title = "dem_age")

# fin_achieve_goals
ggplot(data = cfpb_results) +
  geom_point(mapping = aes(x=cfpb_score, y=fin_achieve_goals, color=fin_achieve_goals)) +
  labs (title = "fin_achieve_goals")

# fin_saving_habit
ggplot(data = cfpb_results) +
  geom_point(mapping = aes(x=cfpb_score, y=fin_saving_habit, color=fin_saving_habit)) +
  labs (title = "fin_saving_habit")

# fin_consult_budget
ggplot(data = cfpb_results) +
  geom_point(mapping = aes(x=cfpb_score, y=fin_consult_budget, color=fin_consult_budget)) +
  labs (title = "fin_consult_budget")

# fin_set_goals
ggplot(data = cfpb_results) +
  geom_point(mapping = aes(x=cfpb_score, y=fin_set_goals, color=fin_set_goals)) +
  labs (title = "fin_set_goals")

# cfpb

cfpb_results$.residual <- cfpb_results$cfpb_score - cfpb_results$.pred


ggplot(cfpb_results,
    aes(.residual, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(.residual)), linetype = "dashed", color="red", size = 0.5) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Residual Histogram")

summary(cfpb_df$cfpb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   48.00   56.00   56.08   65.00   95.00

cfpb_results %>%
#    filter(cfpb_score > 47 & cfpb_score < 66) %>%
    ggplot(
        aes(y = abs(.residual), x = .pred)) +
        geom_point(alpha = 0.7, color="blue",size = 0.1) +
        geom_smooth(method = lm) + 
        #geom_vline(aes(xintercept = mean(.residual)), linetype = "dashed", color="red", size = 0.5) +
        scale_fill_brewer(palette="Spectral") +
        labs(title = "CFPB Residual (vs predictions)")

## `geom_smooth()` using formula 'y ~ x'

summary(cfpb_df$cfpb_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   48.00   56.00   56.08   65.00   95.00

# econ_hh_income
ggplot(data = cfpb_results) +
  geom_boxplot(mapping = aes(x=.residual, y=econ_hh_income, color=econ_hh_income)) +
  labs (title = "econ_hh_income")

# econ_savings
ggplot(data = cfpb_results) +
  geom_boxplot(mapping = aes(x=.residual, y=econ_savings, color=econ_savings)) +
  labs (title = "econ_savings")

# per_health
ggplot(data = cfpb_results) +
  geom_boxplot(mapping = aes(x=.residual, y=per_health, color=per_health)) +
  labs (title = "per_health")

# dem_age
ggplot(data = cfpb_results) +
  geom_boxplot(mapping = aes(x=.residual, y=dem_age, color=dem_age)) +
  labs (title = "dem_age")

# fin_achieve_goals
ggplot(data = cfpb_results) +
  geom_boxplot(mapping = aes(x=.residual, y=fin_achieve_goals, color=fin_achieve_goals)) +
  labs (title = "fin_achieve_goals")

# fin_saving_habit
ggplot(data = cfpb_results) +
  geom_boxplot(mapping = aes(x=.residual, y=fin_saving_habit, color=fin_saving_habit)) +
  labs (title = "fin_saving_habit")

# fin_consult_budget
ggplot(data = cfpb_results) +
  geom_boxplot(mapping = aes(x=.residual, y=fin_consult_budget, color=fin_consult_budget)) +
  labs (title = "fin_consult_budget")

# fin_set_goals
ggplot(data = cfpb_results) +
  geom_boxplot(mapping = aes(x=.residual, y=fin_set_goals, color=fin_set_goals)) +
  labs (title = "fin_set_goals")

Demographic Model

The demographic model that includes race, gender and education has limited predictive power for overall cfpb score on its own It would be interesting to review the impacts of race, gender and education if we held other variables constant.

Model stats - r-squared for the model is 0.08288 - The residual analysis analysis shows several outliers that are impacting the model

# Step:  AIC=22814.15
# cfpb_score ~  dem_edu + dem_race +dem_gender   


# Model Specification
lm_model <- linear_reg() %>% 
            set_engine('lm') %>% 
            set_mode('regression')


# Fitting to Trained Data
lm_fit <- lm_model %>% 
          fit(cfpb_score ~ dem_edu + dem_race + dem_gender 
              
              , data = cfpb_training)


# Explore Training Results
names(lm_fit)

## [1] "lvl"     "spec"    "fit"     "preproc" "elapsed"

summary(lm_fit$fit)

## 
## Call:
## stats::lm(formula = cfpb_score ~ dem_edu + dem_race + dem_gender, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.443  -8.084  -0.120   8.121  43.702 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          50.3671     0.7956  63.310  < 2e-16 ***
## dem_eduHigh school degree/GED         4.7528     0.8220   5.782 7.81e-09 ***
## dem_eduSome college/Associate         5.5119     0.8098   6.807 1.11e-11 ***
## dem_eduBachelor's degree              9.7164     0.8496  11.436  < 2e-16 ***
## dem_eduGraduate/professional degree  12.6420     0.8722  14.495  < 2e-16 ***
## dem_raceBlack, Non-Hispanic          -3.4910     0.6202  -5.629 1.91e-08 ***
## dem_raceOther, Non-Hispanic          -3.4305     0.8265  -4.151 3.37e-05 ***
## dem_raceHispanic                     -3.1819     0.5671  -5.611 2.12e-08 ***
## dem_genderFemale                     -0.6401     0.3790  -1.689   0.0913 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 5101 degrees of freedom
## Multiple R-squared:  0.08432,    Adjusted R-squared:  0.08288 
## F-statistic: 58.72 on 8 and 5101 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)) 
plot(lm_fit$fit, pch = 16, col = '#006EA1')

# Data frame of estimated coefficients
tidy(lm_fit)

## # A tibble: 9 × 5
##   term                                estimate std.error statistic  p.value
##   <chr>                                  <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                           50.4       0.796     63.3  0       
## 2 dem_eduHigh school degree/GED          4.75      0.822      5.78 7.81e- 9
## 3 dem_eduSome college/Associate          5.51      0.810      6.81 1.11e-11
## 4 dem_eduBachelor's degree               9.72      0.850     11.4  6.36e-30
## 5 dem_eduGraduate/professional degree   12.6       0.872     14.5  1.09e-46
## 6 dem_raceBlack, Non-Hispanic           -3.49      0.620     -5.63 1.91e- 8
## 7 dem_raceOther, Non-Hispanic           -3.43      0.826     -4.15 3.37e- 5
## 8 dem_raceHispanic                      -3.18      0.567     -5.61 2.12e- 8
## 9 dem_genderFemale                      -0.640     0.379     -1.69 9.13e- 2

# Performance metrics on training data
glance(lm_fit)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0843        0.0829  13.4      58.7 4.49e-92     8 -20495. 41011. 41076.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# variable importance
vip(lm_fit, num_features=15)

Testing Demographic Model

#predict(lm_fit, new_data = cfpb_test)

cfpb_results <- predict(lm_fit, new_data = cfpb_test) %>% 
                            bind_cols(cfpb_test)

#cfpb_results

# RMSE on test set
rmse(cfpb_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        13.9

rsq(cfpb_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard      0.0749

ggplot(data = cfpb_results,
       mapping = aes(x = .pred, y = cfpb_score)) +
  geom_point(color = '#006EA1') +
  geom_smooth(method = lm) +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Demographic',
       x = 'Predicted',
       y = 'Actual')

## `geom_smooth()` using formula 'y ~ x'

Financial Planning

The model that focused on financial planning, budgeting and setting goals has some explanatory power. Although 0.2717 is relativily low in comparison to the other models it is significant when considered in the context of the variable that are not included such as income, education and age.

Model stats - r-squared for the model is 0.2717 - The residual analysis analysis shows several outliers that are impacting the model

# Step:  AIC=22814.15
# cfpb_score ~  fin_goals + fin_saving_habit + fin_frugal + 
#                       fin_consult_budget + fin_follow_budget + fin_set_goals + fin_clear_plan   


# Model Specification
lm_model <- linear_reg() %>% 
            set_engine('lm') %>% 
            set_mode('regression')


# Fitting to Trained Data
lm_fit <- lm_model %>% 
          fit(cfpb_score ~ fin_goals + fin_saving_habit + fin_frugal + fin_consult_budget + 
                  fin_follow_budget + fin_set_goals + fin_clear_plan  
              
              , data = cfpb_training)


# Explore Training Results
names(lm_fit)

## [1] "lvl"     "spec"    "fit"     "preproc" "elapsed"

summary(lm_fit$fit)

## 
## Call:
## stats::lm(formula = cfpb_score ~ fin_goals + fin_saving_habit + 
##     fin_frugal + fin_consult_budget + fin_follow_budget + fin_set_goals + 
##     fin_clear_plan, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.224  -7.087  -0.196   7.042  54.220 
## 
## Coefficients: (3 not defined because of singularities)
##                                              Estimate Std. Error t value
## (Intercept)                                   52.2624     4.9314  10.598
## fin_goalsNo                                    2.1419     1.6357   1.309
## fin_goalsYes                                   0.1991     1.6296   0.122
## fin_saving_habitStrongly disagree             -7.6011     5.3079  -1.432
## fin_saving_habitDisagree                      -2.0373     5.2728  -0.386
## fin_saving_habitDisagree slightly              0.7912     5.2671   0.150
## fin_saving_habitAgree slightly                 5.1534     5.2539   0.981
## fin_saving_habitAgree                         10.3637     5.2577   1.971
## fin_saving_habitStrongly agree                15.1631     5.2560   2.885
## fin_frugalStrongly disagree                  -17.2394     6.8211  -2.527
## fin_frugalDisagree                           -12.8933     6.4805  -1.990
## fin_frugalDisagree slightly                  -12.2969     6.2551  -1.966
## fin_frugalAgree slightly                     -13.0762     6.2088  -2.106
## fin_frugalAgree                              -11.8171     6.2016  -1.906
## fin_frugalStrongly agree                     -13.3798     6.1965  -2.159
## fin_consult_budgetStrongly disagree           21.1514     7.6840   2.753
## fin_consult_budgetDisagree                    16.6486     7.5975   2.191
## fin_consult_budgetNeither agree nor disagree  14.2768     7.5813   1.883
## fin_consult_budgetAgree                       12.7164     7.5853   1.676
## fin_consult_budgetStrongly agree              10.3244     7.5639   1.365
## fin_follow_budgetStrongly disagree            -2.7322     1.8653  -1.465
## fin_follow_budgetDisagree                      0.9328     0.9581   0.974
## fin_follow_budgetNeither agree nor disagree    0.6871     0.7849   0.875
## fin_follow_budgetAgree                         0.3698     0.6901   0.536
## fin_follow_budgetStrongly agree                    NA         NA      NA
## fin_set_goalsStrongly disagree                -6.3217     1.7734  -3.565
## fin_set_goalsDisagree                         -5.7318     0.9861  -5.813
## fin_set_goalsNeither agree nor disagree       -4.5447     0.7716  -5.890
## fin_set_goalsAgree                            -3.1480     0.6707  -4.694
## fin_set_goalsStrongly agree                        NA         NA      NA
## fin_clear_planStrongly disagree               -4.7413     1.3481  -3.517
## fin_clear_planDisagree                        -2.9043     0.9594  -3.027
## fin_clear_planNeither agree nor disagree      -2.8367     0.8794  -3.226
## fin_clear_planAgree                           -1.7227     0.8266  -2.084
## fin_clear_planStrongly agree                       NA         NA      NA
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## fin_goalsNo                                  0.190453    
## fin_goalsYes                                 0.902768    
## fin_saving_habitStrongly disagree            0.152193    
## fin_saving_habitDisagree                     0.699235    
## fin_saving_habitDisagree slightly            0.880598    
## fin_saving_habitAgree slightly               0.326701    
## fin_saving_habitAgree                        0.048764 *  
## fin_saving_habitStrongly agree               0.003932 ** 
## fin_frugalStrongly disagree                  0.011523 *  
## fin_frugalDisagree                           0.046694 *  
## fin_frugalDisagree slightly                  0.049363 *  
## fin_frugalAgree slightly                     0.035248 *  
## fin_frugalAgree                              0.056770 .  
## fin_frugalStrongly agree                     0.030878 *  
## fin_consult_budgetStrongly disagree          0.005932 ** 
## fin_consult_budgetDisagree                   0.028473 *  
## fin_consult_budgetNeither agree nor disagree 0.059735 .  
## fin_consult_budgetAgree                      0.093713 .  
## fin_consult_budgetStrongly agree             0.172326    
## fin_follow_budgetStrongly disagree           0.143057    
## fin_follow_budgetDisagree                    0.330316    
## fin_follow_budgetNeither agree nor disagree  0.381369    
## fin_follow_budgetAgree                       0.592045    
## fin_follow_budgetStrongly agree                    NA    
## fin_set_goalsStrongly disagree               0.000368 ***
## fin_set_goalsDisagree                        6.52e-09 ***
## fin_set_goalsNeither agree nor disagree      4.11e-09 ***
## fin_set_goalsAgree                           2.75e-06 ***
## fin_set_goalsStrongly agree                        NA    
## fin_clear_planStrongly disagree              0.000440 ***
## fin_clear_planDisagree                       0.002480 ** 
## fin_clear_planNeither agree nor disagree     0.001265 ** 
## fin_clear_planAgree                          0.037212 *  
## fin_clear_planStrongly agree                       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.91 on 5078 degrees of freedom
## Multiple R-squared:  0.2761, Adjusted R-squared:  0.2717 
## F-statistic: 62.47 on 31 and 5078 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)) 
plot(lm_fit$fit, pch = 16, col = '#006EA1')

# Data frame of estimated coefficients
tidy(lm_fit)

## # A tibble: 35 × 5
##    term                              estimate std.error statistic  p.value
##    <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                         52.3        4.93    10.6   5.69e-26
##  2 fin_goalsNo                          2.14       1.64     1.31  1.90e- 1
##  3 fin_goalsYes                         0.199      1.63     0.122 9.03e- 1
##  4 fin_saving_habitStrongly disagree   -7.60       5.31    -1.43  1.52e- 1
##  5 fin_saving_habitDisagree            -2.04       5.27    -0.386 6.99e- 1
##  6 fin_saving_habitDisagree slightly    0.791      5.27     0.150 8.81e- 1
##  7 fin_saving_habitAgree slightly       5.15       5.25     0.981 3.27e- 1
##  8 fin_saving_habitAgree               10.4        5.26     1.97  4.88e- 2
##  9 fin_saving_habitStrongly agree      15.2        5.26     2.88  3.93e- 3
## 10 fin_frugalStrongly disagree        -17.2        6.82    -2.53  1.15e- 2
## # … with 25 more rows

# Performance metrics on training data
glance(lm_fit)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.276         0.272  11.9      62.5       0    31 -19895. 39856. 40072.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# variable importance
vip(lm_fit, num_features=15)

Testing Financial Planning Model

#predict(lm_fit, new_data = cfpb_test)

cfpb_results <- predict(lm_fit, new_data = cfpb_test) %>% 
                            bind_cols(cfpb_test)

## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response"): prediction from a rank-deficient fit may be misleading

#cfpb_results

# RMSE on test set
rmse(cfpb_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        12.4

rsq(cfpb_results, truth = cfpb_score, estimate = .pred)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.274

ggplot(data = cfpb_results,
       mapping = aes(x = .pred, y = cfpb_score)) +
  geom_point(color = '#006EA1') +
  geom_smooth(method = lm) +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - CFPB Score',
       x = 'Predicted',
       y = 'Actual')

## `geom_smooth()` using formula 'y ~ x'

# age category

cfpb_results$cfpb_score_10cat <- cut(cfpb_results$cfpb_score, breaks = c(-10, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
                           labels = c("< 10","10-20","20-30","30-40","40-50","50-60","60-70","70-80","80-90","90-100"),
                           right = FALSE,
                           include.lowest=TRUE) 

t <- cfpb_results %>% select(cfpb_score,cfpb_score_10cat)

ggplot(cfpb_results,
    aes(.pred, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(.pred)), linetype = "dashed", color="red", size = 0.5) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (.pred)")

summary(cfpb_df$.pred)

## Length  Class   Mode 
##      0   NULL   NULL

ggplot(data = cfpb_results) +
  geom_boxplot(mapping = aes(x=.pred, y=cfpb_score_10cat, color=cfpb_score_10cat)) +
  labs (title = "cfpb_score_10cat")

ggplot(cfpb_results,
    aes(.pred, fill = cfpb_score_10cat, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(.pred)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by cfpb_score_10cat)")

Part 5 - Conclusion

A linear regression model was developed using the results of the Consumer Financial Protection Bureau’s (CFPB) National Financial Well-Being Survey Public Use File (PUF). The model predicts over 0.5608 cfpb score given the 8 variable selected for the optimized model.

Model Performance - adjusted \(R^2\) 0.5608655
- root mean square error 9.609951

Model Variables The following variable are included in the model as significant contributors to the overall model accuracy.

econ_hh_income (PPINCIMP) - The survey respondents Household Income for a given year.
econ_savings (SAVINGSRANGES) - An assessment of the amount of savings that the survey respondant currenlty has.
per_health (HEALTH) - An assessment of the survey respondents overall health.
dem_age (agecat) - The age of the survey respondent (8 categories).
fin_achieve_goals (GOALCONF) - An assessment of the ability of the survey respondent to achieve their financial goals.
fin_saving_habit (SAVEHABIT) - An assessment of savings habits and how often money is put away.
fin_consult_budget (PROPPLAN_1) - An assessment of how budgeting impacts spending decisions.
fin_set_goals (PROPPLAN_3) - An assessment of financial planning and goal setting discipline.

The findings for the analysis are not surprising and are consistent with several studies in the industry however deeper analysis is warranted. The cfpb scaled score is designed to cover the entire population however it is not surprising that the resulting model breaks down slightly at the higher and lower ranges of the cfpb score. A more localized analysis of cfpb score ranges might provide some additional insight into the factors that impact consumer financial wellness.

References

There are numerous studies of financial wellness sponsored by financial firms, industry groups and government agencies. I choose to focus on two studies:

Consumer Financial Protection Bureau - Financial Wellness Survey Financial well-being data The PUF survey results can be accessed as a csv file Financial well-being survey data

Appendix (optional)

CFPB Financial Well-Being Scale Methodology

The CFPB Financial Well-Being Scale methodology uses the following 10 questions to calculate the cfpb score. There is some overlap between the questions used to calculate the cfpb score and the survey questions. The optimized models presented in this analysis did not include any of the variable that overlap with the scoring questions.

How well does this statement describe you or your situation?

I could handle a major unexpected expense
I am securing my financial future
Because of my money situation, I feel like I will never have the things I want in life*
I can enjoy life because of the way I’m managing my money
I am just getting by financially*
I am concerned that the money I have or will save won’t last*

How often does this statement apply to you?

Giving a gift for a wedding, birthday or other occasion would put a strain on my finances for the month*
I have money left over at the end of the month
I am behind with my finances*
My finances control my life*

The research question is clearly stated, can be answered by the data, and the context of the problem clearly explained.

Review of the variables

The PUF includes financial attitude that were not included as part of the analysis but provide a interesting window into the relationship of financial attitudes to consumer financial wellbeing.

result_df <- getResultsCFPBFile()
result_df <- result_df %>% filter( cfpb_score > 0)

# life_satisfied
ggplot(result_df, aes(x=life_satisfied)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "life_satisfied")

ggplot(data = result_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=life_satisfied, color=life_satisfied)) +
  labs (title = "life_satisfied")

ggplot(result_df,
    aes(cfpb_score, fill = life_satisfied, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by life_satisfied)")

# life_optimistic
ggplot(result_df, aes(x=life_optimistic)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "life_optimistic")

ggplot(data = result_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=life_optimistic, color=life_optimistic)) +
  labs (title = "life_optimistic")

ggplot(result_df,
    aes(cfpb_score, fill = life_optimistic, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by life_optimistic)")

# material_admire_wealth
ggplot(result_df, aes(x=material_admire_wealth)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "material_admire_wealth")

ggplot(data = result_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=material_admire_wealth, color=material_admire_wealth)) +
  labs (title = "material_admire_wealth")

ggplot(result_df,
    aes(cfpb_score, fill = material_admire_wealth, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by material_admire_wealth)")

# material_define_worth
ggplot(result_df, aes(x=material_define_worth)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "material_define_worth")

ggplot(data = result_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=material_define_worth, color=material_define_worth)) +
  labs (title = "material_define_worth")

ggplot(result_df,
    aes(cfpb_score, fill = material_define_worth, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by material_define_worth)")

# material_impress
ggplot(result_df, aes(x=material_impress)) +
  geom_bar(color="white", fill="black", alpha = 0.6, size = 0.1) +
  coord_flip() +
  labs(title = "material_impress")

ggplot(data = result_df) +
  geom_boxplot(mapping = aes(x=cfpb_score, y=material_impress, color=material_impress)) +
  labs (title = "material_impress")

ggplot(result_df,
    aes(cfpb_score, fill = material_impress, y = stat(density))) +
    geom_histogram(binwidth = 1, alpha = 0.7, bins = 100, color="white",size = 0.1) +
    geom_vline(aes(xintercept = mean(cfpb_score)), linetype = "dashed", size = 0.2) +
    scale_fill_brewer(palette="Spectral") +
    labs(title = "CFPB Score (by material_impress)")

DATA 606 Data Project: Causes of Financial Wellbeing

David Simbandumwe