C. Donovan
We've seen low the linear model encompasses a number of our previous methods
These model the signal with a design matrix \( \mathbf{X} \) and associated parameters \( {\beta} \) which are estimated assuming the noise is iid (independent and identically distributed) Normal
We look at extending this model for signal to interactions and more
My favourite equation
Many models are effectively regressions - they are broadly of the form:
\[ y = f(x_1,..., x_p) + noise \]
So we need to estimate the signal \( f \), which is usually contingent on an idea of the distribution of the error (because we have to tease the signal and noise apart)
As we'll see, our estimates for the signal are optimised in light of our model for noise
The term linear can be used differently in different contexts
\[ y = \mathbf{X}\boldsymbol{\beta} \]
\[ y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p \]
we've included categorical covariates in our regression equation by cunning use of the design matrix (via dummy variables).
Returning to the mortgage default data:
factModel <- lm(DEBTINC ~ JOB + VALUE, data = loanData)
summary(factModel)
Call:
lm(formula = DEBTINC ~ JOB + VALUE, data = loanData)
Residuals:
Min 1Q Median 3Q Max
-34.734 -4.695 0.953 4.785 165.285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.282e+01 4.252e-01 77.184 < 2e-16 ***
JOBOffice -6.126e-01 4.518e-01 -1.356 0.1752
JOBOther -5.127e-01 3.956e-01 -1.296 0.1951
JOBProfExe -3.273e+00 4.309e-01 -7.596 3.7e-14 ***
JOBSales 2.205e+00 9.872e-01 2.233 0.0256 *
JOBSelf -9.053e-01 7.820e-01 -1.158 0.2471
VALUE 2.267e-05 2.301e-06 9.854 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.234 on 4429 degrees of freedom
(1524 observations deleted due to missingness)
Multiple R-squared: 0.03583, Adjusted R-squared: 0.03452
F-statistic: 27.43 on 6 and 4429 DF, p-value: < 2.2e-16
To test between factors that are not baseline levels requires:
To test the signifance of the factor variable overall:
anova(factModel)
Analysis of Variance Table
Response: DEBTINC
Df Sum Sq Mean Sq F value Pr(>F)
JOB 5 4576 915.1 13.499 4.295e-13 ***
VALUE 1 6582 6582.2 97.092 < 2.2e-16 ***
Residuals 4429 300258 67.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, similar to doing an ANOVA if there is only the factor covariate in the model. Works more generally when other covariates exist.
Sequential vs non-sequential: Observe there are changes in the \( p \)-values depending on order into the model. This can be large.
factModel <- lm(DEBTINC ~ VALUE + JOB, data = loanData)
anova(factModel)
Analysis of Variance Table
Response: DEBTINC
Df Sum Sq Mean Sq F value Pr(>F)
VALUE 1 4385 4384.8 64.679 1.122e-15 ***
JOB 5 6773 1354.6 19.981 < 2.2e-16 ***
Residuals 4429 300258 67.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You're often after non-sequential.
A different version of this is non-sequential (observe “type II tests” outputted)
library(car)
Anova(factModel)
Anova Table (Type II tests)
Response: DEBTINC
Sum Sq Df F value Pr(>F)
VALUE 6582 1 97.092 < 2.2e-16 ***
JOB 6773 5 19.981 < 2.2e-16 ***
Residuals 300258 4429
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You're often after non-sequential - see later for a common exception where order of testing matters.
Combining numeric covariates conceptually produces a model of signal that is a plane or hyperplane.
# a grid of covariates
x_1 <- 1:40
x_2 <- 1:40
gridX <- expand.grid(x_1 = x_1, x_2 = x_2)
# y is a function of these - note this looks like signal for an lm
y <- 2*gridX$x_1 -2*gridX$x_2 + 100
# giving this
persp(matrix(y, ncol = 40), theta = 150, phi = 30, shade = 0.2, col = 'purple',
xlab = "x1", ylab = "x2", zlab = "y")
Combining numeric covariates conceptually produces a model of signal that is a plane or hyperplane.
Adding categorical covariates to this mix gives different intercepts by factor level - so conceptually multiple lines, planes or hyperplanes
The interpretation of these terms as presented is straightforwards (but not for interactions later):
Covariates don't necessarily affect \( y \) in isolation
The form we've seen so far has our covariates entering into the equation in a simple additive fashion i.e.
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]
This means the affect of \( x_1 \) on \( y \) can be isolated/separated:
The world is rarely this simple.
“for every unit increase of \( x_s \), \( y \) increases by \( A \) units given \( x_p \) is \( B \)”
Thus - the marginal \( x \) terms cannot describe the relationship with \( y \) alone
Lots of surfaces can't be described by only their margins:
Signal they permit
Consider factor \( x_1 \) with two levels \( a \) and \( b \), and a numeric \( x_2 \)
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \times x_2) \]
In short, our model has two lines with different intercepts and slopes
Let's look at the design matrix
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & x_{2,1} & 0\\ 1 & 0 & x_{2,2} & 0\\ 1 & 0 & x_{2,3} & 0\\ 1 & 1 & x_{2,4} & x_{2,4}\\ 1 & 1 & x_{2,5} & x_{2,5}\\ 1 & 1 & x_{2,6} & x_{2,6}\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_3}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0} (+ 0\beta_1) + \hat{\beta_2}x_{2,1} (+ 0\hat{\beta_3})\\ \hat{\beta_0} (+ 0\beta_1) + \hat{\beta_2}x_{2,2} (+ 0\hat{\beta_3})\\ \hat{\beta_0} (+ 0\beta_1) + \hat{\beta_2}x_{2,3} (+ 0\hat{\beta_3})\\ \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_2}x_{2,4} + \hat{\beta_3} x_{2,4}\\ \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_2}x_{2,5} + \hat{\beta_3} x_{2,5}\\ \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_2}x_{2,6} + \hat{\beta_3} x_{2,6}\\ \end{bmatrix} \]
Let's look at the last line, to see the slope modifier
\[ y_6 = \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_2}x_{2,6} + \beta_3 x_{2,6} = (\hat{\beta_0} + \hat{\beta_1}) + (\hat{\beta_2} + \hat{\beta}_3)x_{2,6} \]
This extends to more factor levels and more covariates generally. The magic of design matrices and matrix algebra.
With a simple interaction between house value and reason for loan:
interactionModel <- lm(DEBTINC ~ REASON*VALUE, data = loanData)
summary(interactionModel)
Call:
lm(formula = DEBTINC ~ REASON * VALUE, data = loanData)
Residuals:
Min 1Q Median 3Q Max
-33.814 -4.654 0.890 4.987 168.448
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.233e+01 3.282e-01 98.482 < 2e-16 ***
REASONHomeImp -1.005e+00 5.194e-01 -1.935 0.0531 .
VALUE 1.914e-05 2.855e-06 6.704 2.28e-11 ***
REASONHomeImp:VALUE 2.994e-08 4.293e-06 0.007 0.9944
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.146 on 4472 degrees of freedom
(1484 observations deleted due to missingness)
Multiple R-squared: 0.02057, Adjusted R-squared: 0.01992
F-statistic: 31.31 on 3 and 4472 DF, p-value: < 2.2e-16
summary(interactionModel)
Call:
lm(formula = DEBTINC ~ REASON * VALUE, data = loanData)
Residuals:
Min 1Q Median 3Q Max
-33.814 -4.654 0.890 4.987 168.448
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.233e+01 3.282e-01 98.482 < 2e-16 ***
REASONHomeImp -1.005e+00 5.194e-01 -1.935 0.0531 .
VALUE 1.914e-05 2.855e-06 6.704 2.28e-11 ***
REASONHomeImp:VALUE 2.994e-08 4.293e-06 0.007 0.9944
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.146 on 4472 degrees of freedom
(1484 observations deleted due to missingness)
Multiple R-squared: 0.02057, Adjusted R-squared: 0.01992
F-statistic: 31.31 on 3 and 4472 DF, p-value: < 2.2e-16
Anova(interactionModel)
Anova Table (Type II tests)
Response: DEBTINC
Sum Sq Df F value Pr(>F)
REASON 951 1 14.328 0.0001556 ***
VALUE 5354 1 80.699 < 2.2e-16 ***
REASON:VALUE 0 1 0.000 0.9944352
Residuals 296720 4472
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
See here that the interaction is not significant - our model could be simpler. This term isn't distinguishable from noise.
Model terms involving one covariate are terms main effects (or marginal terms). They may also appear in interactions with other covariates.
As a principle:
Signal they permit
Consider numeric \( x_1 \) and \( x_2 \), the equation is the same as the previous
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \times x_2) \]
In short, we fit a (hyper) plane, plus an interaction surface
Let's look at the design matrix
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1} & x_{1,1} \times x_{2,1}\\ 1 & x_{1,2} & x_{2,2} & x_{1,2} \times x_{2,2}\\ 1 & x_{1,3} & x_{2,3} & x_{1,3} \times x_{2,3}\\ 1 & x_{1,4} & x_{2,4} & x_{1,4} \times x_{2,4}\\ 1 & x_{1,5} & x_{2,5} & x_{1,5} \times x_{2,5}\\ 1 & x_{1,6} & x_{2,6} & x_{1,6} \times x_{2,6}\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_3}\\ \end{bmatrix} = \]
Let's look at the design matrix
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} \hat{\beta_0} + \beta_1 x_{1,1} + \beta_2 x_{2,1} + \beta_3 (x_{1,1} \times x_{2,1})\\ \hat{\beta_0} + \beta_1 x_{1,2} + \beta_2 x_{2,2} + \beta_3 (x_{1,2} \times x_{2,2})\\ \hat{\beta_0} + \beta_1 x_{1,3} + \beta_2 x_{2,3} + \beta_3 (x_{1,3} \times x_{2,3})\\ \hat{\beta_0} + \beta_1 x_{1,4} + \beta_2 x_{2,4} + \beta_3 (x_{1,4} \times x_{2,4})\\ \hat{\beta_0} + \beta_1 x_{1,5} + \beta_2 x_{2,5} + \beta_3 (x_{1,5} \times x_{2,5})\\ \hat{\beta_0} + \beta_1 x_{1,6} + \beta_2 x_{2,6} + \beta_3 (x_{1,6} \times x_{2,6})\\ \end{bmatrix} \]
Let's look how this modifies the slope WRT \( x_1 \):
\[ \hat{\beta_0} + \beta_1 x_{1} + \beta_3 (x_{1} \times x_{2}) \]
with some rearrangement
\[ \hat{\beta_0} + (\beta_1 + \beta_3 (x_{2}))x_{1} \]
So, the slope for \( x_1 \) changes by \( \beta_3 \) for a unit increase in \( x_2 \)
yInt <- gridX$x1 * gridX$x2
image(matrix(yInt, ncol = 40), col = terrain.colors(100))
Signal they permit
Consider factor \( x_1 \) with two levels \( A \) and \( B \), and \( x_2 \) with two levels \( C \) and \( D \)
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 = B, x_2 = D) \]
Through the magic of dummy variables:
Consider factor \( x_1 \) with two levels \( A \) and \( B \), and \( x_2 \) with two levels \( C \) and \( D \)
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 = B, x_2 = D) \]
In short, these estimate different means for all combinations of factor levels:
If \( x_1 \) and \( x_2 \) were simply additive, then \( \beta_3 \) is not needed.
Let's look at the design matrix
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_3}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0} (+ 0\hat{\beta_1} + 0\hat{\beta_2} + 0\hat{\beta_3})\\ \hat{\beta_0} (+ 0\hat{\beta_1} + 0\hat{\beta_2} + 0\hat{\beta_3})\\ \hat{\beta_0} + \hat{\beta_1} + (0\hat{\beta_2} + 0\hat{\beta_3})\\ \hat{\beta_0} + \hat{\beta_1} + (0\hat{\beta_2} + 0\hat{\beta_3})\\ \hat{\beta_0} (+ 0\hat{\beta_1}) + \hat{\beta_2} + (0\hat{\beta_3})\\ \hat{\beta_0} (+ 0\hat{\beta_1}) + \hat{\beta_2} + (0\hat{\beta_3})\\ \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_2} + \hat{\beta_3}\\ \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_2} + \hat{\beta_3}\\ \end{bmatrix} \]
We've covered:
Upcoming in linear models: