Canada is currently experiencing record-breaking levels of household debt. Consumer spending is central to the Canadian economy and therefore to financial stability. However, with the household debt ratio reaching 163% there is a growing concern that households are overextended. Household debt reporting often mentions low interest rates and rising real estate prices as the main drivers. I want to build a model that could accurately predict household debt levels for different types of households across ontario, taking into account a variety of both quantitative and qualitative factors.
Before proceeding, Here are few definition for “household”. According Statistics Canada, a household “refers to a person or group of persons who occupy the same dwelling and do not have a usual place of residence elsewhere in Canada or abroad.”
Moreover, debt is defined as “An amount of money borrowed by one party from another”. More specific assumptions which will be detailed below.
Our data collection process started with researching Statscan and other online resources for survey results related to household finances. The dataset was not available directly through Statistics Canada’s online page, however, I was able to obtain the data via the University of Toronto online Database.
We found The ‘Survey of Financial Security 2005’, a comprehensive, Canada-wide survey that uses 5,276 households to represent the approximately 12.5 million household population of Canada at the time. Moreover, each household contained 86 predictor (“x”variables).
We stratified our survey data (focusing only on Ontario) which reduced our data points to 1390. Furthermore, we filtered our data sets based on the following conditions: * Total debt over $1 Million * Total debt of $0 * A primary mortgage of zero * Removing all the derived (subtotal) that influence the model.
financial_survey_converted = subset(financial_survey_raw, subset=(wdtotal < 1000000))
financial_survey_converted = subset(financial_survey_raw, subset=(wdtotal != 0))
financial_survey_converted = subset(financial_survey_raw, subset=(wdprmor != 0))
financial_survey_converted = subset(financial_survey_raw, subset=(region == 3))
financial_survey_converted <- subset(financial_survey_converted, select = c(-wastdept, -wastmuic, -wastbond , -waststck, -wastrest , -wastvhle , -wastonof , -wdstomor , -wdstcred , -wdstvhln , -wdstodbt ) )
financial_survey_converted <- sapply(financial_survey_converted, as.numeric)
financial_survey_converted = as.data.frame(financial_survey_converted)
We used descriptive statistics to analyze different trends and variation between Canada and Ontario. The following table shows these observations
| wdTotal | Canada | Ontario |
|---|---|---|
| Mean | 131,035.53 | 161,122.19 |
| Median | 102,087.5 | 131,500 |
| Mode | 100,000 | 100,000 |
| Standard Deviation | 107,635.42 | 120,164.53 |
| Standard Error | 2674.22 | 5484.74 |
Ontario dataset contains around 480 sample points (once the filtering complete). The mean debt level is 161,122.18 with a range from 2,300 to 950,000. The dataset is skewed towards the right with a value of 2.10. This indicates there are some extreme debt values on the right side contributing to skewness. Because there are extreme values to the right, medium debt (131,500) is less than the mean household debt(161,122). Mode debt value of 100,000 indicates there are quite a few number of household with debt level of 100,000 , which is less than both mean and medium.
Standard deviation value is 120,164 which indicate some variability in the household debt levels. The coefficient of variance is 0.74, which mean there is good amount variability within dataset. This variability could be attributed to multiple household related factors. When compared to the Canadian household debt level, Ontario debt level have tighter variance. This can attribute to similar household economics condition with a region.
With our cleaned up and simplified data set, we used regression analysis to construct three different model.
cors <- sapply(financial_survey_converted, cor, y=financial_survey_converted$wdtotal)
## Warning in FUN(X[[i]], ...): the standard deviation is zero
mask <- (rank(-abs(cors)) <= 10 )
best10.pred <- financial_survey_converted[, mask]
best10.pred <- subset(best10.pred, select = c(-wdtotal) )
summary(best10.pred)
## fmsz27 nbear27 mtinc27 ecfexhmr
## Min. :1.000 Min. :0.000 Min. : -19000 Min. : 0
## 1st Qu.:1.250 1st Qu.:1.000 1st Qu.: 22250 1st Qu.: 6500
## Median :2.000 Median :1.000 Median : 57500 Median : 11000
## Mean :2.555 Mean :1.428 Mean : 101369 Mean : 15082
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.: 105000 3rd Qu.: 18375
## Max. :5.000 Max. :3.000 Max. :1900000 Max. :145000
## ecfexvhr watotpg waprval wdprmor
## Min. : 0 Min. : 175 Min. : 0 Min. : 0
## 1st Qu.: 675 1st Qu.: 142556 1st Qu.: 0 1st Qu.: 0
## Median : 1600 Median : 418850 Median : 185000 Median : 0
## Mean : 2593 Mean : 1105443 Mean : 314452 Mean : 57146
## 3rd Qu.: 3100 3rd Qu.: 914500 3rd Qu.: 350000 3rd Qu.: 84375
## Max. :23000 Max. :49750000 Max. :3800000 Max. :1450000
## wdstloc
## Min. : 0
## 1st Qu.: 0
## Median : 0
## Mean : 11506
## 3rd Qu.: 4750
## Max. :470000
full.model.best10 <- lm (wdtotal ~ fmsz27 + nbear27 + mtinc27 + ecfexhmr + ecfexvhr + watotpt + waprval + wdprmor , data=financial_survey_converted)
reduced.model.best10<- step (full.model.best10, direction = "backward")
## Start: AIC=31510.99
## wdtotal ~ fmsz27 + nbear27 + mtinc27 + ecfexhmr + ecfexvhr +
## watotpt + waprval + wdprmor
##
## Df Sum of Sq RSS AIC
## - fmsz27 1 5.9267e+08 9.6112e+12 31509
## - mtinc27 1 1.5087e+09 9.6121e+12 31509
## <none> 9.6106e+12 31511
## - ecfexhmr 1 4.4474e+10 9.6551e+12 31515
## - waprval 1 4.5262e+10 9.6559e+12 31516
## - nbear27 1 6.5589e+10 9.6762e+12 31518
## - ecfexvhr 1 8.7809e+10 9.6984e+12 31522
## - watotpt 1 1.1999e+11 9.7306e+12 31526
## - wdprmor 1 1.2094e+13 2.1704e+13 32641
##
## Step: AIC=31509.08
## wdtotal ~ nbear27 + mtinc27 + ecfexhmr + ecfexvhr + watotpt +
## waprval + wdprmor
##
## Df Sum of Sq RSS AIC
## - mtinc27 1 1.5770e+09 9.6128e+12 31507
## <none> 9.6112e+12 31509
## - ecfexhmr 1 4.3970e+10 9.6552e+12 31513
## - waprval 1 4.4681e+10 9.6559e+12 31514
## - nbear27 1 8.2198e+10 9.6934e+12 31519
## - ecfexvhr 1 8.7314e+10 9.6985e+12 31520
## - watotpt 1 1.2362e+11 9.7348e+12 31525
## - wdprmor 1 1.2098e+13 2.1709e+13 32640
##
## Step: AIC=31507.31
## wdtotal ~ nbear27 + ecfexhmr + ecfexvhr + watotpt + waprval +
## wdprmor
##
## Df Sum of Sq RSS AIC
## <none> 9.6128e+12 31507
## - ecfexhmr 1 4.3286e+10 9.6561e+12 31512
## - waprval 1 4.3984e+10 9.6568e+12 31512
## - nbear27 1 8.0683e+10 9.6935e+12 31517
## - ecfexvhr 1 8.6001e+10 9.6988e+12 31518
## - watotpt 1 1.2423e+11 9.7370e+12 31523
## - wdprmor 1 1.2126e+13 2.1739e+13 32640
summary(reduced.model.best10)
##
## Call:
## lm(formula = wdtotal ~ nbear27 + ecfexhmr + ecfexvhr + watotpt +
## waprval + wdprmor, data = financial_survey_converted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -343074 -26687 -9434 2507 1066851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.090e+03 4.340e+03 -1.403 0.160791
## nbear27 8.779e+03 2.577e+03 3.407 0.000676 ***
## ecfexhmr 5.405e-01 2.166e-01 2.496 0.012694 *
## ecfexvhr 2.675e+00 7.604e-01 3.518 0.000450 ***
## watotpt 4.249e-03 1.005e-03 4.228 2.52e-05 ***
## waprval 1.750e-02 6.955e-03 2.516 0.011997 *
## wdprmor 1.010e+00 2.419e-02 41.768 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83370 on 1383 degrees of freedom
## Multiple R-squared: 0.7439, Adjusted R-squared: 0.7428
## F-statistic: 669.7 on 6 and 1383 DF, p-value: < 2.2e-16
min.model.best10 <- lm (wdtotal ~ 1, data=financial_survey_converted)
forward.model.best10 <- step(min.model.best10, direction="forward", scope = ( ~ fmsz27 + nbear27 + mtinc27 + ecfexhmr + ecfexvhr + watotpt + waprval + wdprmor))
## Start: AIC=33388.97
## wdtotal ~ 1
##
## Df Sum of Sq RSS AIC
## + wdprmor 1 2.6849e+13 1.0693e+13 31645
## + ecfexhmr 1 1.4528e+13 2.3014e+13 32711
## + waprval 1 6.3028e+12 3.1239e+13 33136
## + ecfexvhr 1 4.4687e+12 3.3073e+13 33215
## + fmsz27 1 3.3620e+12 3.4179e+13 33261
## + nbear27 1 3.1830e+12 3.4358e+13 33268
## + mtinc27 1 2.8542e+12 3.4687e+13 33281
## + watotpt 1 2.2682e+12 3.5273e+13 33304
## <none> 3.7541e+13 33389
##
## Step: AIC=31645.27
## wdtotal ~ wdprmor
##
## Df Sum of Sq RSS AIC
## + watotpt 1 6.2819e+11 1.0064e+13 31563
## + waprval 1 6.1639e+11 1.0076e+13 31565
## + ecfexhmr 1 3.5457e+11 1.0338e+13 31600
## + mtinc27 1 3.5080e+11 1.0342e+13 31601
## + ecfexvhr 1 3.3932e+11 1.0353e+13 31602
## + nbear27 1 2.8265e+11 1.0410e+13 31610
## + fmsz27 1 1.5859e+11 1.0534e+13 31627
## <none> 1.0693e+13 31645
##
## Step: AIC=31563.11
## wdtotal ~ wdprmor + watotpt
##
## Df Sum of Sq RSS AIC
## + ecfexvhr 1 2.3143e+11 9.8329e+12 31533
## + nbear27 1 2.0451e+11 9.8598e+12 31537
## + ecfexhmr 1 1.4893e+11 9.9154e+12 31544
## + waprval 1 1.2614e+11 9.9382e+12 31548
## + fmsz27 1 1.1390e+11 9.9504e+12 31549
## + mtinc27 1 4.2946e+10 1.0021e+13 31559
## <none> 1.0064e+13 31563
##
## Step: AIC=31532.78
## wdtotal ~ wdprmor + watotpt + ecfexvhr
##
## Df Sum of Sq RSS AIC
## + nbear27 1 9.7644e+10 9.7353e+12 31521
## + ecfexhmr 1 9.2457e+10 9.7404e+12 31522
## + waprval 1 8.7396e+10 9.7455e+12 31522
## + fmsz27 1 4.0686e+10 9.7922e+12 31529
## + mtinc27 1 1.4981e+10 9.8179e+12 31533
## <none> 9.8329e+12 31533
##
## Step: AIC=31520.91
## wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27
##
## Df Sum of Sq RSS AIC
## + waprval 1 7.9199e+10 9.6561e+12 31512
## + ecfexhmr 1 7.8501e+10 9.6568e+12 31512
## <none> 9.7353e+12 31521
## + mtinc27 1 4.7585e+09 9.7305e+12 31522
## + fmsz27 1 1.5269e+09 9.7337e+12 31523
##
## Step: AIC=31511.55
## wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27 + waprval
##
## Df Sum of Sq RSS AIC
## + ecfexhmr 1 4.3286e+10 9.6128e+12 31507
## <none> 9.6561e+12 31512
## + mtinc27 1 8.9297e+08 9.6552e+12 31513
## + fmsz27 1 1.1113e+08 9.6559e+12 31514
##
## Step: AIC=31507.31
## wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27 + waprval +
## ecfexhmr
##
## Df Sum of Sq RSS AIC
## <none> 9.6128e+12 31507
## + mtinc27 1 1577020822 9.6112e+12 31509
## + fmsz27 1 660944195 9.6121e+12 31509
summary(forward.model.best10)
##
## Call:
## lm(formula = wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27 +
## waprval + ecfexhmr, data = financial_survey_converted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -343074 -26687 -9434 2507 1066851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.090e+03 4.340e+03 -1.403 0.160791
## wdprmor 1.010e+00 2.419e-02 41.768 < 2e-16 ***
## watotpt 4.249e-03 1.005e-03 4.228 2.52e-05 ***
## ecfexvhr 2.675e+00 7.604e-01 3.518 0.000450 ***
## nbear27 8.779e+03 2.577e+03 3.407 0.000676 ***
## waprval 1.750e-02 6.955e-03 2.516 0.011997 *
## ecfexhmr 5.405e-01 2.166e-01 2.496 0.012694 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83370 on 1383 degrees of freedom
## Multiple R-squared: 0.7439, Adjusted R-squared: 0.7428
## F-statistic: 669.7 on 6 and 1383 DF, p-value: < 2.2e-16
After analyzing the data,we identified following independent variables out of the 85 parameters could be used to generate our model.Following variables were kept:
| Response Variable | Household Debt [wdtotal] |
|---|---|
| Quantitative Factor Variable | Family income after taxes [atinc27] |
| Quantitative Factor Variable | Number of family members [fmsz27] |
| Quantitative Factor Variable | Number of credit cards [dvfcrn] |
| Quantitative Factor Variable | Student debt [wdsloan] |
| Quantitative Factor Variable | Child related expense [ecfexchr] |
| Quantitative Factor Variable | Subtotal-credit card & instalment debt [wdstcred] |
| Quantitative Factor Variable | Assets (continue basis) [watotpt] |
| Quantitative Factor Variable | Mortgage on principle residence [wdprmor] |
| Quantitative Factor Variable | Mortgage on other residence [wdstomor] |
| Quantitative Factor Variable | Line of credit [wdstloc] |
| Quantitative Factor Variable | Net worth including pension [wnetwpg] |
| Quantitative Factor Variable | Age of major income earner [ecpage] |
| Qualitative Factor Variable | Number of earners in the family [nbear27] |
| Qualitative Factor Variable | Region [region] |
| Qualitative Factor Variable | Level of education [dvphlv2g] |
| Qualitative Factor Variable | Sex of major income earner [hcsex_r] |
Once we generated the model we found from the above only below variables were statistically significant.
reg_multi <- lm(wdtotal ~ wdprmor + nbear27 + wdsloan + wdstloc , data=financial_survey_converted)
summary(reg_multi)
##
## Call:
## lm(formula = wdtotal ~ wdprmor + nbear27 + wdsloan + wdstloc,
## data = financial_survey_converted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155412 -17810 -10950 -4899 1071111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.899e+03 3.611e+03 1.357 0.17510
## wdprmor 1.091e+00 1.679e-02 64.967 < 2e-16 ***
## nbear27 6.051e+03 2.213e+03 2.735 0.00632 **
## wdsloan 4.978e-01 3.575e-01 1.392 0.16407
## wdstloc 1.125e+00 5.191e-02 21.674 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74710 on 1385 degrees of freedom
## Multiple R-squared: 0.7941, Adjusted R-squared: 0.7935
## F-statistic: 1335 on 4 and 1385 DF, p-value: < 2.2e-16
Comparing the both forward and backword step model using anova returns a p value of 0 , indicating that both models are same.
anova(reduced.model.best10, forward.model.best10)
## Analysis of Variance Table
##
## Model 1: wdtotal ~ nbear27 + ecfexhmr + ecfexvhr + watotpt + waprval +
## wdprmor
## Model 2: wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27 + waprval +
## ecfexhmr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1383 9.6128e+12
## 2 1383 9.6128e+12 0 0
Comparing Step model with custom model using anova retuns a p value less than 0.05, indicating that both models are different.
anova(reduced.model.best10, reg_multi)
## Analysis of Variance Table
##
## Model 1: wdtotal ~ nbear27 + ecfexhmr + ecfexvhr + watotpt + waprval +
## wdprmor
## Model 2: wdtotal ~ wdprmor + nbear27 + wdsloan + wdstloc
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1383 9.6128e+12
## 2 1385 7.7298e+12 -2 1.883e+12
Diagostic plots for the both the models are show below. Based on the plots we can say these are pretty good regression models.
plot(reduced.model.best10)
The points in the Residuals Vs Fitted Plot are radomly scattered with no pattern for both model. The points in the Normal Q-Q plot are more or less on the line, indicating that residuals follow a normal distribution. In both Scale-Location plot and Residual Vs Leverage plots, the points are in the a group with none too far from center.
plot(reg_multi)
Again based on the plots similar observation was found. The points in the Residuals Vs Fitted Plot are radomly scattered with no pattern for both model. The points in the Normal Q-Q plot are more or less on the line, indicating that residuals follow a normal distribution. In both Scale-Location plot and Residual Vs Leverage plots, the points are in the a group with none too far from center.
Our model addressess managerial and statistical question. For a Canadian household that fits the constraints of our model, we can predict with 95% confidence interval what their total debt is.
In this example we have taken a random customer profile from the raw file, it actual wdtotal value is 150500. Step model predict a value of 191129.7 where as Custom model predict 187595.8
customerProfile = as.data.frame(financial_survey_raw[80,])
customerProfile$wdtotal
## [1] 150500
predict(reduced.model.best10, subset(customerProfile, select = c(fmsz27 , nbear27 , mtinc27 , ecfexhmr , ecfexvhr , watotpt , waprval , wdprmor)))
## 80
## 192139.7
predict(reg_multi, subset(customerProfile, select = c(wdprmor , nbear27 , wdsloan , wdstloc )))
## 80
## 187595.8