MT5762 Lecture 17

C. Donovan

More on linear models

We've seen a number of ways of increasing the complexity of the model of signal

How do we decide which model is appropriate, without a strong a priori view?

The linear model form - recap

My favourite equation

Many models are effectively regressions - they are broadly of the form:

\[ y = f(x_1,..., x_p) + noise \]

So we need to estimate the signal \( f \), which is usually contingent on an idea of the distribution of the error (because we have to tease the signal and noise apart)

As we'll see, our estimates for the signal are optimised in light of our model for noise

The linear model form - recap

The term linear can be used differently in different contexts

  • It might refer to some shape of relationship e.g. a straight line relating some variable \( x \) to \( y \)
  • Here we mean the linear form. In short we model the signal as \( y = f(x1,...x_p) \) where it is

\[ y = \mathbf{X}\boldsymbol{\beta} \]

  • Recall this matrix equation means things like:

\[ y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p \]

Choosing models

So we can now add terms for: numeric/factor covariates, their interactions, polynomials, etc

The range of potential models, even for a few covariates, is massive

If we don't know the model exactly, how do we choose between candidates?

Choosing models - significance

An obvious way is to drop terms that are statistically insignificant

  • If the test of the parameter shows 0 is a plausible value, it is potentially just noise
  • We may have to test sequentially: higher order interactions first (i.e. don't interpret marginal effects if they are part of an interaction)

Choosing models - significance

  interactionModel <- lm(DEBTINC ~ REASON*VALUE, data = loanData)
  summary(interactionModel)

Call:
lm(formula = DEBTINC ~ REASON * VALUE, data = loanData)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.814  -4.654   0.890   4.987 168.448 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.233e+01  3.282e-01  98.482  < 2e-16 ***
REASONHomeImp       -1.005e+00  5.194e-01  -1.935   0.0531 .  
VALUE                1.914e-05  2.855e-06   6.704 2.28e-11 ***
REASONHomeImp:VALUE  2.994e-08  4.293e-06   0.007   0.9944    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.146 on 4472 degrees of freedom
  (1484 observations deleted due to missingness)
Multiple R-squared:  0.02057,   Adjusted R-squared:  0.01992 
F-statistic: 31.31 on 3 and 4472 DF,  p-value: < 2.2e-16

Choosing models - significance

  library(car)
  Anova(interactionModel)
Anova Table (Type II tests)

Response: DEBTINC
             Sum Sq   Df F value    Pr(>F)    
REASON          951    1  14.328 0.0001556 ***
VALUE          5354    1  80.699 < 2.2e-16 ***
REASON:VALUE      0    1   0.000 0.9944352    
Residuals    296720 4472                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So I'd drop the interaction term - it's arguably conflating noise with signal

Choosing models - significance

  interactionModel <- lm(DEBTINC ~ REASON + VALUE, data = loanData)
  Anova(interactionModel)
Anova Table (Type II tests)

Response: DEBTINC
          Sum Sq   Df F value    Pr(>F)    
REASON       951    1  14.331 0.0001553 ***
VALUE       5354    1  80.717 < 2.2e-16 ***
Residuals 296720 4473                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Choosing models - significance

This is commonly done, but often not favoured:

  • There is a multiple testing issue - for a large model, lots of tests
  • Aribitrary significance thresholds?
  • It's a greedy approach - the next best model depends on the previous step (not all models are considered) but this is common
  • With lots of data, lots of things are statistically significant (high precision)

We might consider fit measures to choose

\(R^2\) - the ubiquitous R-square

One form is:

\[ R^2=1-\frac{(n-1)^{-1}\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-1)^{-1}\sum_{i=1}^n(y_i-\bar{y})^2} = 1-\frac{{\mathrm SSE}/(n-1)}{{\mathrm SST}/(n-1)} \]

  • SSE is the sum of squared errors for your model, the SST is the total sum of squares for your data.
  • Hence the interpretation as “the proportion of variance explained by the model”
  • Also calculated as the squared correlation between \( y \) and \( \hat{y} \)

The vanilla-flavoured \(R^2\)

  • Bounded by 0 and 1,
  • \( R^2=1 \) is a perfect agreement between the model predictions and the observed data
  • The value 1 would indicate that all of the variability in the response is accounted for by the model
  • A value of zero indicates the worst possible fit, and there is no relationship between the observed data and the model predictions.

Why the \(R^2\) can suck

  • Only reflects our fit to our sample, not unseen data
  • Offers no protection therefore against models that are too complex (ones that don't generalise)
  • Don't use it to choose between models of different complexity

Can we fix it a bit?…

A penalized \(R^2\)

adjusted-\( R^2 \):

\[ 1-\frac{(n-p-1)^{-1}\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-1)^{-1}\sum_{i=1}^n(y_i-\bar{y})^2} =1-\frac{SSE/(n-p-1)}{SST/(n-1)} \]

  • SSE and SST as before
  • so, the proportion of variance explained by the model, but has \( df \) included
  • it does not necessarily increase with the complexity of the model
  • it has a penalty term that introduces two aspects: fidelity to the data and parsimony.
  • (sort of) Bounded by 0 and 1, with large values being a relatively good fit

Akaike's Information Criterion (AIC)

A penalized Likelihood

  • The most widespread measure for model selection within statistics
  • Theory is extensive, but its form is simple:

\[ AIC=-2\ell + 2p \]

  • where the \( \ell \) is the log-likelihood, and \( p \) is the number of parameters in the model.

Akaike's Information Criterion (AIC)

Penalized Likelihood

\[ AIC=-2\ell + 2p \]

  • We seek the model that minimises this value (i.e. head towards -ve \( \infty \))
  • It can be seen that complex models are penalised by the inclusion of \( p \).
  • Similar to the adjusted-\( R^2 \) there is a measure of fidelity of the model to the data and a penalty term
  • The value of the AIC has no particular interpretation - use relatively.

Model selection

We're going to select simpler models. Often simply referred to as parsimony - which is deemed a good thing

Occam's razor: after William of Ockham Entities must not be multiplied beyond necessity (Non sunt multiplicanda entia sine necessitate) - although that's John Punch's (1639).

All possible subsets

Use a brute-force algorithmic approach to search through all possible models - obviously a difficult combinatoric problem. Consider:

  • 40 potential covariates which we will only consider as entering marginally i.e. as main effects gives about \( 10^{12} \) potential models.
  • Considering even only first order interactions for this problem makes much more massive (in the order of \( 10^{200} \) models).

All possible subsets

  • Allowing a 100th of a second to fit each model our sun will have expired well before finishing.
  • There are algorithms (such as leaps and bounds) which have made this problem far more tractable, but 100+ variables is considered a big problem
  • The common algorithms are restricted to basic ordinary-least-squares problems.

Step selection methods

  • Very simple conceptually and are far smaller problems than searching all subsets.
  • They may not (probably won't) return the optimal' model - they are greedy algorithms
  • We must set our objective measure - such as the adjusted-\( R^2 \) or AIC

Forwards selection

  • Begin with a very simple model e.g. \( \mathbf{y}=\hat{\beta}_0 \)
  • Add the 'best' covariate to the current model e.g. the \( x \) that explains the greatest variance on its inclusion, or decreases the AIC most etc.
  • Repeat step 2 until a stopping rule is achieved e.g. the adjusted-\( R^2 \) is no longer increases, AIC is minimised etc.

Backwards selection

  • Begin with a very complex model e.g. contains all covariates \( \mathbf{y}=\mathbf{X}\hat{\boldsymbol{\beta}} \)
  • Remove the least informative covariate from the current model e.g. the covariate with the largest \( p \)-value for the \( t \)-statistic from its parameter estimate, or decreases the AIC most, etc.
  • Repeat step 2 until a stopping rule is achieved e.g. the adjusted-\( R^2 \) is no longer increases, AIC is minimised etc.

Forwards-Backwards selection

  • Begin with a very complex model,
  • Remove the least informative covariate from the current model
  • (after first iteration) Determine if one of the excluded variables could be profitably exchanged with a variable in the model (such as a net decrease in the AIC) - swap if this is true.
  • Repeat from step 2 until a stopping rule is achieved e.g. the adjusted-\( R^2 \) is no longer increases, AIC is minimised etc.

The same rationale can be applied but from a very basic starting model, increasing in complexity.

Step selection - Fit a complex model

  fullModel <- lm(DEBTINC ~ ., data=loanData)
  summary(fullModel)

Call:
lm(formula = DEBTINC ~ ., data = loanData)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.315  -4.251   0.659   4.574 102.252 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2.869e+01  6.097e-01  47.057  < 2e-16 ***
BAD            5.870e+00  4.837e-01  12.138  < 2e-16 ***
LOAN           1.030e-04  1.338e-05   7.703 1.73e-14 ***
MORTDUE        4.203e-05  6.064e-06   6.930 5.03e-12 ***
VALUE         -1.486e-05  5.130e-06  -2.898  0.00378 ** 
REASONHomeImp  7.390e-01  2.935e-01   2.518  0.01186 *  
JOBOffice      5.675e-01  4.707e-01   1.206  0.22800    
JOBOther       3.674e-01  4.106e-01   0.895  0.37097    
JOBProfExe    -2.044e+00  4.391e-01  -4.655 3.36e-06 ***
JOBSales       1.715e+00  1.087e+00   1.578  0.11468    
JOBSelf        3.307e-01  8.585e-01   0.385  0.70012    
YOJ           -2.055e-02  1.781e-02  -1.154  0.24869    
DEROG         -5.995e-01  2.323e-01  -2.581  0.00991 ** 
DELINQ        -1.578e-01  1.669e-01  -0.946  0.34440    
CLAGE         -3.905e-03  1.669e-03  -2.340  0.01933 *  
NINQ           5.412e-01  8.607e-02   6.287 3.65e-10 ***
CLNO           8.406e-02  1.506e-02   5.580 2.59e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.371 on 3347 degrees of freedom
  (2596 observations deleted due to missingness)
Multiple R-squared:  0.1447,    Adjusted R-squared:  0.1406 
F-statistic: 35.39 on 16 and 3347 DF,  p-value: < 2.2e-16

Step selection

Giving (note no interactions used)

  Anova(fullModel)
Anova Table (Type II tests)

Response: DEBTINC
          Sum Sq   Df  F value    Pr(>F)    
BAD         8005    1 147.3204 < 2.2e-16 ***
LOAN        3224    1  59.3422 1.734e-14 ***
MORTDUE     2609    1  48.0252 5.026e-12 ***
VALUE        456    1   8.3961  0.003785 ** 
REASON       344    1   6.3388  0.011859 *  
JOB         3633    5  13.3719 6.227e-13 ***
YOJ           72    1   1.3311  0.248688    
DEROG        362    1   6.6593  0.009906 ** 
DELINQ        49    1   0.8942  0.344401    
CLAGE        298    1   5.4770  0.019326 *  
NINQ        2148    1  39.5296 3.649e-10 ***
CLNO        1692    1  31.1378 2.595e-08 ***
Residuals 181856 3347                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step selection

Delete an insignificant term

  newModel <- update(fullModel, . ~ . -DELINQ)

  Anova(newModel)
Anova Table (Type II tests)

Response: DEBTINC
          Sum Sq   Df  F value    Pr(>F)    
BAD         9819    1 158.9785 < 2.2e-16 ***
LOAN        3315    1  53.6698 2.955e-13 ***
MORTDUE     2726    1  44.1314 3.568e-11 ***
VALUE        421    1   6.8232  0.009038 ** 
REASON       318    1   5.1500  0.023310 *  
JOB         4715    5  15.2702 7.211e-15 ***
YOJ           87    1   1.4024  0.236405    
DEROG        572    1   9.2600  0.002360 ** 
CLAGE        370    1   5.9915  0.014426 *  
NINQ        2042    1  33.0704 9.683e-09 ***
CLNO        1534    1  24.8370 6.554e-07 ***
Residuals 207514 3360                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step selection - automated (by AIC)

  loanData_NONA <- na.omit(loanData)

  fullModel <- lm(DEBTINC ~ ., data=loanData_NONA)

  step(fullModel)
Start:  AIC=13456.65
DEBTINC ~ BAD + LOAN + MORTDUE + VALUE + REASON + JOB + YOJ + 
    DEROG + DELINQ + CLAGE + NINQ + CLNO

          Df Sum of Sq    RSS   AIC
- DELINQ   1      48.6 181905 13456
- YOJ      1      72.3 181928 13456
<none>                 181856 13457
- CLAGE    1     297.6 182154 13460
- REASON   1     344.4 182200 13461
- DEROG    1     361.8 182218 13461
- VALUE    1     456.2 182312 13463
- CLNO     1    1691.8 183548 13486
- NINQ     1    2147.8 184004 13494
- MORTDUE  1    2609.4 184465 13503
- JOB      5    3632.7 185489 13513
- LOAN     1    3224.3 185080 13514
- BAD      1    8004.5 189861 13600

Step:  AIC=13455.54
DEBTINC ~ BAD + LOAN + MORTDUE + VALUE + REASON + JOB + YOJ + 
    DEROG + CLAGE + NINQ + CLNO

          Df Sum of Sq    RSS   AIC
- YOJ      1      76.6 181981 13455
<none>                 181905 13456
- CLAGE    1     304.2 182209 13459
- REASON   1     339.9 182244 13460
- DEROG    1     388.0 182293 13461
- VALUE    1     442.1 182347 13462
- CLNO     1    1646.5 183551 13484
- NINQ     1    2186.0 184091 13494
- MORTDUE  1    2598.1 184503 13501
- JOB      5    3644.3 185549 13512
- LOAN     1    3252.0 185157 13513
- BAD      1    8247.7 190152 13603

Step:  AIC=13454.96
DEBTINC ~ BAD + LOAN + MORTDUE + VALUE + REASON + JOB + DEROG + 
    CLAGE + NINQ + CLNO

          Df Sum of Sq    RSS   AIC
<none>                 181981 13455
- REASON   1     311.4 182293 13459
- DEROG    1     373.7 182355 13460
- CLAGE    1     396.9 182378 13460
- VALUE    1     478.2 182459 13462
- CLNO     1    1624.6 183606 13483
- NINQ     1    2193.2 184175 13493
- MORTDUE  1    2794.7 184776 13504
- JOB      5    3619.0 185600 13511
- LOAN     1    3183.8 185165 13511
- BAD      1    8288.4 190270 13603

Call:
lm(formula = DEBTINC ~ BAD + LOAN + MORTDUE + VALUE + REASON + 
    JOB + DEROG + CLAGE + NINQ + CLNO, data = loanData_NONA)

Coefficients:
  (Intercept)            BAD           LOAN        MORTDUE          VALUE  
    2.857e+01      5.763e+00      1.018e-04      4.300e-05     -1.514e-05  
REASONHomeImp      JOBOffice       JOBOther     JOBProfExe       JOBSales  
    6.990e-01      5.920e-01      3.753e-01     -2.007e+00      1.858e+00  
      JOBSelf          DEROG          CLAGE           NINQ           CLNO  
    4.641e-01     -6.064e-01     -4.392e-03      5.461e-01      8.167e-02  

We get a lot of output - but the final model is:

Anova Table (Type II tests)

Response: DEBTINC
          Sum Sq   Df  F value    Pr(>F)    
BAD         8288    1 152.5313 < 2.2e-16 ***
LOAN        3184    1  58.5922 2.521e-14 ***
MORTDUE     2795    1  51.4306 9.087e-13 ***
VALUE        478    1   8.8010  0.003032 ** 
REASON       311    1   5.7301  0.016731 *  
JOB         3619    5  13.3201 7.030e-13 ***
DEROG        374    1   6.8780  0.008766 ** 
CLAGE        397    1   7.3035  0.006917 ** 
NINQ        2193    1  40.3623 2.394e-10 ***
CLNO        1625    1  29.8969 4.890e-08 ***
Residuals 181981 3349                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All possible subsets selection

Here we use the dredge function, within the MuMIn package

  library(MuMIn)

  fullModel <- lm(DEBTINC ~ ., data=loanData_NONA, na.action = "na.fail")  

  dredgedModel <- dredge(fullModel)

Giving a lot of output (partially displayed):

  head(dredgedModel, n = 10)
Global model call: lm(formula = DEBTINC ~ ., data = loanData_NONA, na.action = "na.fail")
---
Model selection table 
     (Intrc)   BAD     CLAGE    CLNO   DELIN   DEROG JOB      LOAN
2040   28.57 5.763 -0.004392 0.08167         -0.6064   + 1.018e-04
4088   28.68 5.751 -0.003947 0.08227         -0.6184   + 1.034e-04
2048   28.58 5.888 -0.004336 0.08356 -0.1646 -0.5870   + 1.014e-04
4096   28.69 5.870 -0.003905 0.08406 -0.1578 -0.5995   + 1.030e-04
1528   28.86 5.769 -0.004206 0.07972         -0.5920   + 9.467e-05
4086   28.25 5.848           0.07587         -0.6107   + 1.044e-04
3576   28.96 5.759 -0.003847 0.08011         -0.6009   + 9.567e-05
1536   28.87 5.887 -0.004151 0.08149 -0.1558 -0.5735   + 9.427e-05
2024   28.56 5.487 -0.004288 0.07857                   + 1.021e-04
4094   28.27 5.975           0.07786 -0.1683 -0.5906   + 1.039e-04
         MORTD   NINQ REASO      VALUE      YOJ df    logLik    AICc delta
2040 4.300e-05 0.5461     + -1.514e-05          16 -11485.79 23003.7  0.00
4088 4.193e-05 0.5453     + -1.461e-05 -0.02114 17 -11485.08 23004.3  0.60
2048 4.307e-05 0.5418     + -1.539e-05          17 -11485.30 23004.8  1.04
4096 4.203e-05 0.5412     + -1.486e-05 -0.02055 18 -11484.63 23005.5  1.73
1528 4.234e-05 0.5226       -1.413e-05          15 -11488.67 23007.5  3.73
4086 4.302e-05 0.5579     + -1.620e-05 -0.03061 16 -11487.89 23007.9  4.21
3576 4.147e-05 0.5210       -1.368e-05 -0.01667 16 -11488.22 23008.6  4.86
1536 4.240e-05 0.5183       -1.435e-05          16 -11488.23 23008.6  4.88
2024 4.371e-05 0.5120     + -1.540e-05          15 -11489.24 23008.6  4.88
4094 4.311e-05 0.5534     + -1.645e-05 -0.02988 17 -11487.38 23008.9  5.21
     weight
2040  0.297
4088  0.220
2048  0.176
4096  0.125
1528  0.046
4086  0.036
3576  0.026
1536  0.026
2024  0.026
4094  0.022
Models ranked by AICc(x) 

All possible subsets selection

Restricting to the top models

 subset(dredgedModel, delta < 4) 
Global model call: lm(formula = DEBTINC ~ ., data = loanData_NONA, na.action = "na.fail")
---
Model selection table 
     (Intrc)   BAD     CLAGE    CLNO   DELIN   DEROG JOB      LOAN
2040   28.57 5.763 -0.004392 0.08167         -0.6064   + 1.018e-04
4088   28.68 5.751 -0.003947 0.08227         -0.6184   + 1.034e-04
2048   28.58 5.888 -0.004336 0.08356 -0.1646 -0.5870   + 1.014e-04
4096   28.69 5.870 -0.003905 0.08406 -0.1578 -0.5995   + 1.030e-04
1528   28.86 5.769 -0.004206 0.07972         -0.5920   + 9.467e-05
         MORTD   NINQ REASO      VALUE      YOJ df    logLik    AICc delta
2040 4.300e-05 0.5461     + -1.514e-05          16 -11485.79 23003.7  0.00
4088 4.193e-05 0.5453     + -1.461e-05 -0.02114 17 -11485.08 23004.3  0.60
2048 4.307e-05 0.5418     + -1.539e-05          17 -11485.30 23004.8  1.04
4096 4.203e-05 0.5412     + -1.486e-05 -0.02055 18 -11484.63 23005.5  1.73
1528 4.234e-05 0.5226       -1.413e-05          15 -11488.67 23007.5  3.73
     weight
2040  0.344
4088  0.254
2048  0.204
4096  0.145
1528  0.053
Models ranked by AICc(x) 

All possible subsets selection

So

  • you can see there are models that are pretty similar in fit
  • but are different in stucture (so interpretation may not be trivially different)

For predictive models we might consider model-averaging (more on this in data-mining/machine-learning)

Why model selection might be good

  • We simplify our description of a system (why include things that might be noise?)
  • We might improve its generality for predictions
  • We end up with more data per parameter, giving more precision

Why model selection might be bad

  • The algorithmic inclusion/exclusion of variables can lead to models that aren't sensible
  • Simple perturbations of the data can lead to markedly different models i.e. the selection might be sensitive to noise
  • Basically we optimise for fit, not for interpretability
  • If automated, the intermediate models are not checked at all for validity

Recap and look-forwards

We've covered:

  • Model selection - need and controversy
  • Measures of model fit, particularly the AIC
  • Selection methods

Upcoming in linear models:

  • More diagnostics
  • Case study