C. Donovan
We've seen a number of ways of increasing the complexity of the model of signal
How do we decide which model is appropriate, without a strong a priori view?
My favourite equation
Many models are effectively regressions - they are broadly of the form:
\[ y = f(x_1,..., x_p) + noise \]
So we need to estimate the signal \( f \), which is usually contingent on an idea of the distribution of the error (because we have to tease the signal and noise apart)
As we'll see, our estimates for the signal are optimised in light of our model for noise
The term linear can be used differently in different contexts
\[ y = \mathbf{X}\boldsymbol{\beta} \]
\[ y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p \]
So we can now add terms for: numeric/factor covariates, their interactions, polynomials, etc
The range of potential models, even for a few covariates, is massive
If we don't know the model exactly, how do we choose between candidates?
An obvious way is to drop terms that are statistically insignificant
interactionModel <- lm(DEBTINC ~ REASON*VALUE, data = loanData)
summary(interactionModel)
Call:
lm(formula = DEBTINC ~ REASON * VALUE, data = loanData)
Residuals:
Min 1Q Median 3Q Max
-33.814 -4.654 0.890 4.987 168.448
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.233e+01 3.282e-01 98.482 < 2e-16 ***
REASONHomeImp -1.005e+00 5.194e-01 -1.935 0.0531 .
VALUE 1.914e-05 2.855e-06 6.704 2.28e-11 ***
REASONHomeImp:VALUE 2.994e-08 4.293e-06 0.007 0.9944
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.146 on 4472 degrees of freedom
(1484 observations deleted due to missingness)
Multiple R-squared: 0.02057, Adjusted R-squared: 0.01992
F-statistic: 31.31 on 3 and 4472 DF, p-value: < 2.2e-16
library(car)
Anova(interactionModel)
Anova Table (Type II tests)
Response: DEBTINC
Sum Sq Df F value Pr(>F)
REASON 951 1 14.328 0.0001556 ***
VALUE 5354 1 80.699 < 2.2e-16 ***
REASON:VALUE 0 1 0.000 0.9944352
Residuals 296720 4472
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So I'd drop the interaction term - it's arguably conflating noise with signal
interactionModel <- lm(DEBTINC ~ REASON + VALUE, data = loanData)
Anova(interactionModel)
Anova Table (Type II tests)
Response: DEBTINC
Sum Sq Df F value Pr(>F)
REASON 951 1 14.331 0.0001553 ***
VALUE 5354 1 80.717 < 2.2e-16 ***
Residuals 296720 4473
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is commonly done, but often not favoured:
We might consider fit measures to choose
One form is:
\[ R^2=1-\frac{(n-1)^{-1}\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-1)^{-1}\sum_{i=1}^n(y_i-\bar{y})^2} = 1-\frac{{\mathrm SSE}/(n-1)}{{\mathrm SST}/(n-1)} \]
Can we fix it a bit?…
adjusted-\( R^2 \):
\[ 1-\frac{(n-p-1)^{-1}\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-1)^{-1}\sum_{i=1}^n(y_i-\bar{y})^2} =1-\frac{SSE/(n-p-1)}{SST/(n-1)} \]
A penalized Likelihood
\[ AIC=-2\ell + 2p \]
Penalized Likelihood
\[ AIC=-2\ell + 2p \]
We're going to select simpler models. Often simply referred to as parsimony - which is deemed a good thing
Occam's razor: after William of Ockham Entities must not be multiplied beyond necessity (Non sunt multiplicanda entia sine necessitate) - although that's John Punch's (1639).
Use a brute-force algorithmic approach to search through all possible models - obviously a difficult combinatoric problem. Consider:
The same rationale can be applied but from a very basic starting model, increasing in complexity.
fullModel <- lm(DEBTINC ~ ., data=loanData)
summary(fullModel)
Call:
lm(formula = DEBTINC ~ ., data = loanData)
Residuals:
Min 1Q Median 3Q Max
-42.315 -4.251 0.659 4.574 102.252
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.869e+01 6.097e-01 47.057 < 2e-16 ***
BAD 5.870e+00 4.837e-01 12.138 < 2e-16 ***
LOAN 1.030e-04 1.338e-05 7.703 1.73e-14 ***
MORTDUE 4.203e-05 6.064e-06 6.930 5.03e-12 ***
VALUE -1.486e-05 5.130e-06 -2.898 0.00378 **
REASONHomeImp 7.390e-01 2.935e-01 2.518 0.01186 *
JOBOffice 5.675e-01 4.707e-01 1.206 0.22800
JOBOther 3.674e-01 4.106e-01 0.895 0.37097
JOBProfExe -2.044e+00 4.391e-01 -4.655 3.36e-06 ***
JOBSales 1.715e+00 1.087e+00 1.578 0.11468
JOBSelf 3.307e-01 8.585e-01 0.385 0.70012
YOJ -2.055e-02 1.781e-02 -1.154 0.24869
DEROG -5.995e-01 2.323e-01 -2.581 0.00991 **
DELINQ -1.578e-01 1.669e-01 -0.946 0.34440
CLAGE -3.905e-03 1.669e-03 -2.340 0.01933 *
NINQ 5.412e-01 8.607e-02 6.287 3.65e-10 ***
CLNO 8.406e-02 1.506e-02 5.580 2.59e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.371 on 3347 degrees of freedom
(2596 observations deleted due to missingness)
Multiple R-squared: 0.1447, Adjusted R-squared: 0.1406
F-statistic: 35.39 on 16 and 3347 DF, p-value: < 2.2e-16
Giving (note no interactions used)
Anova(fullModel)
Anova Table (Type II tests)
Response: DEBTINC
Sum Sq Df F value Pr(>F)
BAD 8005 1 147.3204 < 2.2e-16 ***
LOAN 3224 1 59.3422 1.734e-14 ***
MORTDUE 2609 1 48.0252 5.026e-12 ***
VALUE 456 1 8.3961 0.003785 **
REASON 344 1 6.3388 0.011859 *
JOB 3633 5 13.3719 6.227e-13 ***
YOJ 72 1 1.3311 0.248688
DEROG 362 1 6.6593 0.009906 **
DELINQ 49 1 0.8942 0.344401
CLAGE 298 1 5.4770 0.019326 *
NINQ 2148 1 39.5296 3.649e-10 ***
CLNO 1692 1 31.1378 2.595e-08 ***
Residuals 181856 3347
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Delete an insignificant term
newModel <- update(fullModel, . ~ . -DELINQ)
Anova(newModel)
Anova Table (Type II tests)
Response: DEBTINC
Sum Sq Df F value Pr(>F)
BAD 9819 1 158.9785 < 2.2e-16 ***
LOAN 3315 1 53.6698 2.955e-13 ***
MORTDUE 2726 1 44.1314 3.568e-11 ***
VALUE 421 1 6.8232 0.009038 **
REASON 318 1 5.1500 0.023310 *
JOB 4715 5 15.2702 7.211e-15 ***
YOJ 87 1 1.4024 0.236405
DEROG 572 1 9.2600 0.002360 **
CLAGE 370 1 5.9915 0.014426 *
NINQ 2042 1 33.0704 9.683e-09 ***
CLNO 1534 1 24.8370 6.554e-07 ***
Residuals 207514 3360
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
loanData_NONA <- na.omit(loanData)
fullModel <- lm(DEBTINC ~ ., data=loanData_NONA)
step(fullModel)
Start: AIC=13456.65
DEBTINC ~ BAD + LOAN + MORTDUE + VALUE + REASON + JOB + YOJ +
DEROG + DELINQ + CLAGE + NINQ + CLNO
Df Sum of Sq RSS AIC
- DELINQ 1 48.6 181905 13456
- YOJ 1 72.3 181928 13456
<none> 181856 13457
- CLAGE 1 297.6 182154 13460
- REASON 1 344.4 182200 13461
- DEROG 1 361.8 182218 13461
- VALUE 1 456.2 182312 13463
- CLNO 1 1691.8 183548 13486
- NINQ 1 2147.8 184004 13494
- MORTDUE 1 2609.4 184465 13503
- JOB 5 3632.7 185489 13513
- LOAN 1 3224.3 185080 13514
- BAD 1 8004.5 189861 13600
Step: AIC=13455.54
DEBTINC ~ BAD + LOAN + MORTDUE + VALUE + REASON + JOB + YOJ +
DEROG + CLAGE + NINQ + CLNO
Df Sum of Sq RSS AIC
- YOJ 1 76.6 181981 13455
<none> 181905 13456
- CLAGE 1 304.2 182209 13459
- REASON 1 339.9 182244 13460
- DEROG 1 388.0 182293 13461
- VALUE 1 442.1 182347 13462
- CLNO 1 1646.5 183551 13484
- NINQ 1 2186.0 184091 13494
- MORTDUE 1 2598.1 184503 13501
- JOB 5 3644.3 185549 13512
- LOAN 1 3252.0 185157 13513
- BAD 1 8247.7 190152 13603
Step: AIC=13454.96
DEBTINC ~ BAD + LOAN + MORTDUE + VALUE + REASON + JOB + DEROG +
CLAGE + NINQ + CLNO
Df Sum of Sq RSS AIC
<none> 181981 13455
- REASON 1 311.4 182293 13459
- DEROG 1 373.7 182355 13460
- CLAGE 1 396.9 182378 13460
- VALUE 1 478.2 182459 13462
- CLNO 1 1624.6 183606 13483
- NINQ 1 2193.2 184175 13493
- MORTDUE 1 2794.7 184776 13504
- JOB 5 3619.0 185600 13511
- LOAN 1 3183.8 185165 13511
- BAD 1 8288.4 190270 13603
Call:
lm(formula = DEBTINC ~ BAD + LOAN + MORTDUE + VALUE + REASON +
JOB + DEROG + CLAGE + NINQ + CLNO, data = loanData_NONA)
Coefficients:
(Intercept) BAD LOAN MORTDUE VALUE
2.857e+01 5.763e+00 1.018e-04 4.300e-05 -1.514e-05
REASONHomeImp JOBOffice JOBOther JOBProfExe JOBSales
6.990e-01 5.920e-01 3.753e-01 -2.007e+00 1.858e+00
JOBSelf DEROG CLAGE NINQ CLNO
4.641e-01 -6.064e-01 -4.392e-03 5.461e-01 8.167e-02
We get a lot of output - but the final model is:
Anova Table (Type II tests)
Response: DEBTINC
Sum Sq Df F value Pr(>F)
BAD 8288 1 152.5313 < 2.2e-16 ***
LOAN 3184 1 58.5922 2.521e-14 ***
MORTDUE 2795 1 51.4306 9.087e-13 ***
VALUE 478 1 8.8010 0.003032 **
REASON 311 1 5.7301 0.016731 *
JOB 3619 5 13.3201 7.030e-13 ***
DEROG 374 1 6.8780 0.008766 **
CLAGE 397 1 7.3035 0.006917 **
NINQ 2193 1 40.3623 2.394e-10 ***
CLNO 1625 1 29.8969 4.890e-08 ***
Residuals 181981 3349
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we use the dredge
function, within the MuMIn
package
library(MuMIn)
fullModel <- lm(DEBTINC ~ ., data=loanData_NONA, na.action = "na.fail")
dredgedModel <- dredge(fullModel)
Giving a lot of output (partially displayed):
head(dredgedModel, n = 10)
Global model call: lm(formula = DEBTINC ~ ., data = loanData_NONA, na.action = "na.fail")
---
Model selection table
(Intrc) BAD CLAGE CLNO DELIN DEROG JOB LOAN
2040 28.57 5.763 -0.004392 0.08167 -0.6064 + 1.018e-04
4088 28.68 5.751 -0.003947 0.08227 -0.6184 + 1.034e-04
2048 28.58 5.888 -0.004336 0.08356 -0.1646 -0.5870 + 1.014e-04
4096 28.69 5.870 -0.003905 0.08406 -0.1578 -0.5995 + 1.030e-04
1528 28.86 5.769 -0.004206 0.07972 -0.5920 + 9.467e-05
4086 28.25 5.848 0.07587 -0.6107 + 1.044e-04
3576 28.96 5.759 -0.003847 0.08011 -0.6009 + 9.567e-05
1536 28.87 5.887 -0.004151 0.08149 -0.1558 -0.5735 + 9.427e-05
2024 28.56 5.487 -0.004288 0.07857 + 1.021e-04
4094 28.27 5.975 0.07786 -0.1683 -0.5906 + 1.039e-04
MORTD NINQ REASO VALUE YOJ df logLik AICc delta
2040 4.300e-05 0.5461 + -1.514e-05 16 -11485.79 23003.7 0.00
4088 4.193e-05 0.5453 + -1.461e-05 -0.02114 17 -11485.08 23004.3 0.60
2048 4.307e-05 0.5418 + -1.539e-05 17 -11485.30 23004.8 1.04
4096 4.203e-05 0.5412 + -1.486e-05 -0.02055 18 -11484.63 23005.5 1.73
1528 4.234e-05 0.5226 -1.413e-05 15 -11488.67 23007.5 3.73
4086 4.302e-05 0.5579 + -1.620e-05 -0.03061 16 -11487.89 23007.9 4.21
3576 4.147e-05 0.5210 -1.368e-05 -0.01667 16 -11488.22 23008.6 4.86
1536 4.240e-05 0.5183 -1.435e-05 16 -11488.23 23008.6 4.88
2024 4.371e-05 0.5120 + -1.540e-05 15 -11489.24 23008.6 4.88
4094 4.311e-05 0.5534 + -1.645e-05 -0.02988 17 -11487.38 23008.9 5.21
weight
2040 0.297
4088 0.220
2048 0.176
4096 0.125
1528 0.046
4086 0.036
3576 0.026
1536 0.026
2024 0.026
4094 0.022
Models ranked by AICc(x)
Restricting to the top models
subset(dredgedModel, delta < 4)
Global model call: lm(formula = DEBTINC ~ ., data = loanData_NONA, na.action = "na.fail")
---
Model selection table
(Intrc) BAD CLAGE CLNO DELIN DEROG JOB LOAN
2040 28.57 5.763 -0.004392 0.08167 -0.6064 + 1.018e-04
4088 28.68 5.751 -0.003947 0.08227 -0.6184 + 1.034e-04
2048 28.58 5.888 -0.004336 0.08356 -0.1646 -0.5870 + 1.014e-04
4096 28.69 5.870 -0.003905 0.08406 -0.1578 -0.5995 + 1.030e-04
1528 28.86 5.769 -0.004206 0.07972 -0.5920 + 9.467e-05
MORTD NINQ REASO VALUE YOJ df logLik AICc delta
2040 4.300e-05 0.5461 + -1.514e-05 16 -11485.79 23003.7 0.00
4088 4.193e-05 0.5453 + -1.461e-05 -0.02114 17 -11485.08 23004.3 0.60
2048 4.307e-05 0.5418 + -1.539e-05 17 -11485.30 23004.8 1.04
4096 4.203e-05 0.5412 + -1.486e-05 -0.02055 18 -11484.63 23005.5 1.73
1528 4.234e-05 0.5226 -1.413e-05 15 -11488.67 23007.5 3.73
weight
2040 0.344
4088 0.254
2048 0.204
4096 0.145
1528 0.053
Models ranked by AICc(x)
So
For predictive models we might consider model-averaging (more on this in data-mining/machine-learning)
We've covered:
Upcoming in linear models: