Part 1 - Introduction:

Canada is currently experiencing record-breaking levels of household debt. Consumer spending is central to the Canadian economy and therefore to financial stability. However, with the household debt ratio reaching 163% there is a growing concern that households are overextended. Household debt reporting often mentions low interest rates and rising real estate prices as the main drivers. I want to build a model that could accurately predict household debt levels for different types of households across ontario, taking into account a variety of both quantitative and qualitative factors.

Part 2 - Data:

2.1 - Definations

Before proceeding, Here are few definition for “household”. According Statistics Canada, a household “refers to a person or group of persons who occupy the same dwelling and do not have a usual place of residence elsewhere in Canada or abroad.”
Moreover, debt is defined as “An amount of money borrowed by one party from another”. More specific assumptions which will be detailed below.

2.2 - Data Sources

Our data collection process started with researching Statscan and other online resources for survey results related to household finances. The dataset was not available directly through Statistics Canada’s online page, however, I was able to obtain the data via the University of Toronto online Database.

We found The ‘Survey of Financial Security 2005’, a comprehensive, Canada-wide survey that uses 5,276 households to represent the approximately 12.5 million household population of Canada at the time. Moreover, each household contained 86 predictor (“x”variables).

2.3 - Sampling method

We stratified our survey data (focusing only on Ontario) which reduced our data points to 1390. Furthermore, we filtered our data sets based on the following conditions: * Total debt over $1 Million * Total debt of $0 * A primary mortgage of zero * Removing all the derived (subtotal) that influence the model.

financial_survey_converted = subset(financial_survey_raw, subset=(wdtotal < 1000000))

financial_survey_converted = subset(financial_survey_raw, subset=(wdtotal != 0))
financial_survey_converted = subset(financial_survey_raw, subset=(wdprmor != 0))
financial_survey_converted = subset(financial_survey_raw, subset=(region == 3))
financial_survey_converted <- subset(financial_survey_converted, select = c(-wastdept, -wastmuic, -wastbond , -waststck, -wastrest , -wastvhle , -wastonof , -wdstomor , -wdstcred , -wdstvhln ,  -wdstodbt    ) )

financial_survey_converted <- sapply(financial_survey_converted, as.numeric)
financial_survey_converted = as.data.frame(financial_survey_converted)

Part 3 - Exploratory data analysis:

3.2 - Descriptive Statistics

We used descriptive statistics to analyze different trends and variation between Canada and Ontario. The following table shows these observations

wdTotal Canada Ontario
Mean 131,035.53 161,122.19
Median 102,087.5 131,500
Mode 100,000 100,000
Standard Deviation 107,635.42 120,164.53
Standard Error 2674.22 5484.74

Ontario dataset contains around 480 sample points (once the filtering complete). The mean debt level is 161,122.18 with a range from 2,300 to 950,000. The dataset is skewed towards the right with a value of 2.10. This indicates there are some extreme debt values on the right side contributing to skewness. Because there are extreme values to the right, medium debt (131,500) is less than the mean household debt(161,122). Mode debt value of 100,000 indicates there are quite a few number of household with debt level of 100,000 , which is less than both mean and medium.

Standard deviation value is 120,164 which indicate some variability in the household debt levels. The coefficient of variance is 0.74, which mean there is good amount variability within dataset. This variability could be attributed to multiple household related factors. When compared to the Canadian household debt level, Ontario debt level have tighter variance. This can attribute to similar household economics condition with a region.

3.2 - Linear Regression Analysis

With our cleaned up and simplified data set, we used regression analysis to construct three different model.

3.2.1 Select top 10 high correlation predictor.

cors <- sapply(financial_survey_converted, cor, y=financial_survey_converted$wdtotal)
## Warning in FUN(X[[i]], ...): the standard deviation is zero
mask <- (rank(-abs(cors)) <= 10 )
best10.pred <- financial_survey_converted[, mask]

best10.pred <- subset(best10.pred, select = c(-wdtotal) )
summary(best10.pred)
##      fmsz27         nbear27         mtinc27           ecfexhmr     
##  Min.   :1.000   Min.   :0.000   Min.   : -19000   Min.   :     0  
##  1st Qu.:1.250   1st Qu.:1.000   1st Qu.:  22250   1st Qu.:  6500  
##  Median :2.000   Median :1.000   Median :  57500   Median : 11000  
##  Mean   :2.555   Mean   :1.428   Mean   : 101369   Mean   : 15082  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.: 105000   3rd Qu.: 18375  
##  Max.   :5.000   Max.   :3.000   Max.   :1900000   Max.   :145000  
##     ecfexvhr        watotpg            waprval           wdprmor       
##  Min.   :    0   Min.   :     175   Min.   :      0   Min.   :      0  
##  1st Qu.:  675   1st Qu.:  142556   1st Qu.:      0   1st Qu.:      0  
##  Median : 1600   Median :  418850   Median : 185000   Median :      0  
##  Mean   : 2593   Mean   : 1105443   Mean   : 314452   Mean   :  57146  
##  3rd Qu.: 3100   3rd Qu.:  914500   3rd Qu.: 350000   3rd Qu.:  84375  
##  Max.   :23000   Max.   :49750000   Max.   :3800000   Max.   :1450000  
##     wdstloc      
##  Min.   :     0  
##  1st Qu.:     0  
##  Median :     0  
##  Mean   : 11506  
##  3rd Qu.:  4750  
##  Max.   :470000

3.2.2 Stepwise backword regression.

full.model.best10 <- lm (wdtotal ~  fmsz27 + nbear27 + mtinc27 + ecfexhmr + ecfexvhr + watotpt + waprval + wdprmor   , data=financial_survey_converted)

reduced.model.best10<- step (full.model.best10, direction = "backward")
## Start:  AIC=31510.99
## wdtotal ~ fmsz27 + nbear27 + mtinc27 + ecfexhmr + ecfexvhr + 
##     watotpt + waprval + wdprmor
## 
##            Df  Sum of Sq        RSS   AIC
## - fmsz27    1 5.9267e+08 9.6112e+12 31509
## - mtinc27   1 1.5087e+09 9.6121e+12 31509
## <none>                   9.6106e+12 31511
## - ecfexhmr  1 4.4474e+10 9.6551e+12 31515
## - waprval   1 4.5262e+10 9.6559e+12 31516
## - nbear27   1 6.5589e+10 9.6762e+12 31518
## - ecfexvhr  1 8.7809e+10 9.6984e+12 31522
## - watotpt   1 1.1999e+11 9.7306e+12 31526
## - wdprmor   1 1.2094e+13 2.1704e+13 32641
## 
## Step:  AIC=31509.08
## wdtotal ~ nbear27 + mtinc27 + ecfexhmr + ecfexvhr + watotpt + 
##     waprval + wdprmor
## 
##            Df  Sum of Sq        RSS   AIC
## - mtinc27   1 1.5770e+09 9.6128e+12 31507
## <none>                   9.6112e+12 31509
## - ecfexhmr  1 4.3970e+10 9.6552e+12 31513
## - waprval   1 4.4681e+10 9.6559e+12 31514
## - nbear27   1 8.2198e+10 9.6934e+12 31519
## - ecfexvhr  1 8.7314e+10 9.6985e+12 31520
## - watotpt   1 1.2362e+11 9.7348e+12 31525
## - wdprmor   1 1.2098e+13 2.1709e+13 32640
## 
## Step:  AIC=31507.31
## wdtotal ~ nbear27 + ecfexhmr + ecfexvhr + watotpt + waprval + 
##     wdprmor
## 
##            Df  Sum of Sq        RSS   AIC
## <none>                   9.6128e+12 31507
## - ecfexhmr  1 4.3286e+10 9.6561e+12 31512
## - waprval   1 4.3984e+10 9.6568e+12 31512
## - nbear27   1 8.0683e+10 9.6935e+12 31517
## - ecfexvhr  1 8.6001e+10 9.6988e+12 31518
## - watotpt   1 1.2423e+11 9.7370e+12 31523
## - wdprmor   1 1.2126e+13 2.1739e+13 32640
summary(reduced.model.best10)
## 
## Call:
## lm(formula = wdtotal ~ nbear27 + ecfexhmr + ecfexvhr + watotpt + 
##     waprval + wdprmor, data = financial_survey_converted)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -343074  -26687   -9434    2507 1066851 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.090e+03  4.340e+03  -1.403 0.160791    
## nbear27      8.779e+03  2.577e+03   3.407 0.000676 ***
## ecfexhmr     5.405e-01  2.166e-01   2.496 0.012694 *  
## ecfexvhr     2.675e+00  7.604e-01   3.518 0.000450 ***
## watotpt      4.249e-03  1.005e-03   4.228 2.52e-05 ***
## waprval      1.750e-02  6.955e-03   2.516 0.011997 *  
## wdprmor      1.010e+00  2.419e-02  41.768  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83370 on 1383 degrees of freedom
## Multiple R-squared:  0.7439, Adjusted R-squared:  0.7428 
## F-statistic: 669.7 on 6 and 1383 DF,  p-value: < 2.2e-16

3.2.3 Stepwise forward regression.

min.model.best10 <- lm (wdtotal ~ 1, data=financial_survey_converted)
forward.model.best10 <- step(min.model.best10, direction="forward", scope = ( ~ fmsz27 + nbear27 + mtinc27 + ecfexhmr + ecfexvhr + watotpt + waprval + wdprmor))
## Start:  AIC=33388.97
## wdtotal ~ 1
## 
##            Df  Sum of Sq        RSS   AIC
## + wdprmor   1 2.6849e+13 1.0693e+13 31645
## + ecfexhmr  1 1.4528e+13 2.3014e+13 32711
## + waprval   1 6.3028e+12 3.1239e+13 33136
## + ecfexvhr  1 4.4687e+12 3.3073e+13 33215
## + fmsz27    1 3.3620e+12 3.4179e+13 33261
## + nbear27   1 3.1830e+12 3.4358e+13 33268
## + mtinc27   1 2.8542e+12 3.4687e+13 33281
## + watotpt   1 2.2682e+12 3.5273e+13 33304
## <none>                   3.7541e+13 33389
## 
## Step:  AIC=31645.27
## wdtotal ~ wdprmor
## 
##            Df  Sum of Sq        RSS   AIC
## + watotpt   1 6.2819e+11 1.0064e+13 31563
## + waprval   1 6.1639e+11 1.0076e+13 31565
## + ecfexhmr  1 3.5457e+11 1.0338e+13 31600
## + mtinc27   1 3.5080e+11 1.0342e+13 31601
## + ecfexvhr  1 3.3932e+11 1.0353e+13 31602
## + nbear27   1 2.8265e+11 1.0410e+13 31610
## + fmsz27    1 1.5859e+11 1.0534e+13 31627
## <none>                   1.0693e+13 31645
## 
## Step:  AIC=31563.11
## wdtotal ~ wdprmor + watotpt
## 
##            Df  Sum of Sq        RSS   AIC
## + ecfexvhr  1 2.3143e+11 9.8329e+12 31533
## + nbear27   1 2.0451e+11 9.8598e+12 31537
## + ecfexhmr  1 1.4893e+11 9.9154e+12 31544
## + waprval   1 1.2614e+11 9.9382e+12 31548
## + fmsz27    1 1.1390e+11 9.9504e+12 31549
## + mtinc27   1 4.2946e+10 1.0021e+13 31559
## <none>                   1.0064e+13 31563
## 
## Step:  AIC=31532.78
## wdtotal ~ wdprmor + watotpt + ecfexvhr
## 
##            Df  Sum of Sq        RSS   AIC
## + nbear27   1 9.7644e+10 9.7353e+12 31521
## + ecfexhmr  1 9.2457e+10 9.7404e+12 31522
## + waprval   1 8.7396e+10 9.7455e+12 31522
## + fmsz27    1 4.0686e+10 9.7922e+12 31529
## + mtinc27   1 1.4981e+10 9.8179e+12 31533
## <none>                   9.8329e+12 31533
## 
## Step:  AIC=31520.91
## wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27
## 
##            Df  Sum of Sq        RSS   AIC
## + waprval   1 7.9199e+10 9.6561e+12 31512
## + ecfexhmr  1 7.8501e+10 9.6568e+12 31512
## <none>                   9.7353e+12 31521
## + mtinc27   1 4.7585e+09 9.7305e+12 31522
## + fmsz27    1 1.5269e+09 9.7337e+12 31523
## 
## Step:  AIC=31511.55
## wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27 + waprval
## 
##            Df  Sum of Sq        RSS   AIC
## + ecfexhmr  1 4.3286e+10 9.6128e+12 31507
## <none>                   9.6561e+12 31512
## + mtinc27   1 8.9297e+08 9.6552e+12 31513
## + fmsz27    1 1.1113e+08 9.6559e+12 31514
## 
## Step:  AIC=31507.31
## wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27 + waprval + 
##     ecfexhmr
## 
##           Df  Sum of Sq        RSS   AIC
## <none>                  9.6128e+12 31507
## + mtinc27  1 1577020822 9.6112e+12 31509
## + fmsz27   1  660944195 9.6121e+12 31509
summary(forward.model.best10)
## 
## Call:
## lm(formula = wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27 + 
##     waprval + ecfexhmr, data = financial_survey_converted)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -343074  -26687   -9434    2507 1066851 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.090e+03  4.340e+03  -1.403 0.160791    
## wdprmor      1.010e+00  2.419e-02  41.768  < 2e-16 ***
## watotpt      4.249e-03  1.005e-03   4.228 2.52e-05 ***
## ecfexvhr     2.675e+00  7.604e-01   3.518 0.000450 ***
## nbear27      8.779e+03  2.577e+03   3.407 0.000676 ***
## waprval      1.750e-02  6.955e-03   2.516 0.011997 *  
## ecfexhmr     5.405e-01  2.166e-01   2.496 0.012694 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83370 on 1383 degrees of freedom
## Multiple R-squared:  0.7439, Adjusted R-squared:  0.7428 
## F-statistic: 669.7 on 6 and 1383 DF,  p-value: < 2.2e-16

3.2.4 Selection based on model and understanding of the data.

After analyzing the data,we identified following independent variables out of the 85 parameters could be used to generate our model.Following variables were kept:

Response Variable Household Debt [wdtotal]
Quantitative Factor Variable Family income after taxes [atinc27]
Quantitative Factor Variable Number of family members [fmsz27]
Quantitative Factor Variable Number of credit cards [dvfcrn]
Quantitative Factor Variable Student debt [wdsloan]
Quantitative Factor Variable Child related expense [ecfexchr]
Quantitative Factor Variable Subtotal-credit card & instalment debt [wdstcred]
Quantitative Factor Variable Assets (continue basis) [watotpt]
Quantitative Factor Variable Mortgage on principle residence [wdprmor]
Quantitative Factor Variable Mortgage on other residence [wdstomor]
Quantitative Factor Variable Line of credit [wdstloc]
Quantitative Factor Variable Net worth including pension [wnetwpg]
Quantitative Factor Variable Age of major income earner [ecpage]
Qualitative Factor Variable Number of earners in the family [nbear27]
Qualitative Factor Variable Region [region]
Qualitative Factor Variable Level of education [dvphlv2g]
Qualitative Factor Variable Sex of major income earner [hcsex_r]

Once we generated the model we found from the above only below variables were statistically significant.

 reg_multi <- lm(wdtotal ~ wdprmor + nbear27 + wdsloan + wdstloc , data=financial_survey_converted)
summary(reg_multi)
## 
## Call:
## lm(formula = wdtotal ~ wdprmor + nbear27 + wdsloan + wdstloc, 
##     data = financial_survey_converted)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -155412  -17810  -10950   -4899 1071111 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.899e+03  3.611e+03   1.357  0.17510    
## wdprmor     1.091e+00  1.679e-02  64.967  < 2e-16 ***
## nbear27     6.051e+03  2.213e+03   2.735  0.00632 ** 
## wdsloan     4.978e-01  3.575e-01   1.392  0.16407    
## wdstloc     1.125e+00  5.191e-02  21.674  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74710 on 1385 degrees of freedom
## Multiple R-squared:  0.7941, Adjusted R-squared:  0.7935 
## F-statistic:  1335 on 4 and 1385 DF,  p-value: < 2.2e-16

Part4 Model comparision

4.1 Comparing Models by using ANOVA

Comparing the both forward and backword step model using anova returns a p value of 0 , indicating that both models are same.

anova(reduced.model.best10, forward.model.best10)
## Analysis of Variance Table
## 
## Model 1: wdtotal ~ nbear27 + ecfexhmr + ecfexvhr + watotpt + waprval + 
##     wdprmor
## Model 2: wdtotal ~ wdprmor + watotpt + ecfexvhr + nbear27 + waprval + 
##     ecfexhmr
##   Res.Df        RSS Df Sum of Sq F Pr(>F)
## 1   1383 9.6128e+12                      
## 2   1383 9.6128e+12  0         0

Comparing Step model with custom model using anova retuns a p value less than 0.05, indicating that both models are different.

anova(reduced.model.best10, reg_multi)
## Analysis of Variance Table
## 
## Model 1: wdtotal ~ nbear27 + ecfexhmr + ecfexvhr + watotpt + waprval + 
##     wdprmor
## Model 2: wdtotal ~ wdprmor + nbear27 + wdsloan + wdstloc
##   Res.Df        RSS Df Sum of Sq F Pr(>F)
## 1   1383 9.6128e+12                      
## 2   1385 7.7298e+12 -2 1.883e+12

4.2 Diagnosing different linear models

Diagostic plots for the both the models are show below. Based on the plots we can say these are pretty good regression models.

4.2.1 Step wise linear regression.

plot(reduced.model.best10)

The points in the Residuals Vs Fitted Plot are radomly scattered with no pattern for both model. The points in the Normal Q-Q plot are more or less on the line, indicating that residuals follow a normal distribution. In both Scale-Location plot and Residual Vs Leverage plots, the points are in the a group with none too far from center.

4.2.2 Custom Model linear regression.

plot(reg_multi)

Again based on the plots similar observation was found. The points in the Residuals Vs Fitted Plot are radomly scattered with no pattern for both model. The points in the Normal Q-Q plot are more or less on the line, indicating that residuals follow a normal distribution. In both Scale-Location plot and Residual Vs Leverage plots, the points are in the a group with none too far from center.

Part5 Application and Probability

Our model addressess managerial and statistical question. For a Canadian household that fits the constraints of our model, we can predict with 95% confidence interval what their total debt is.

In this example we have taken a random customer profile from the raw file, it actual wdtotal value is 150500. Step model predict a value of 191129.7 where as Custom model predict 187595.8

customerProfile = as.data.frame(financial_survey_raw[80,])
customerProfile$wdtotal
## [1] 150500
predict(reduced.model.best10, subset(customerProfile, select = c(fmsz27 , nbear27 , mtinc27 , ecfexhmr , ecfexvhr , watotpt , waprval , wdprmor)))
##       80 
## 192139.7
predict(reg_multi, subset(customerProfile, select = c(wdprmor  , nbear27 , wdsloan  , wdstloc  )))
##       80 
## 187595.8