MT5762 Lecture 14

C. Donovan

Generalising to linear models

Here we look to unify several of the models seen previously (which generated CIs and \( p \)-values)

The class of models are Linear Models

They have a specific structure for the signal and noise

The linear model form

My favourite equation

Many models are effectively regressions - they are broadly of the form:

\[ y = f(x_1,..., x_p) + noise \]

So we need to estimate the signal \( f \), which is usually contingent on an idea of the distribution of the error (because we have to tease the signal and noise apart)

As we'll see, our estimates for the signal are optimised in light of our model for noise

The linear model form

The term linear can be used differently in different contexts

  • It might refer to some shape of relationship e.g. a straight line relating some variable \( x \) to \( y \)
  • Here we mean the linear form. In short we model the signal as \( y = f(x1,...x_p) \) where it is

\[ y = \mathbf{X}\boldsymbol{\beta} \]

  • Recalling from last week, this matrix equation means things like:

\[ y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p \]

The linear model form

This is veery wide. Naturally it includes a straight line relating \( x \) and \( y \):

\[ y = \beta_0 + \beta_1 x_1 \]

or indeed just modelling \( y \) as a constant (a mean would be best)

\[ y = \beta_0 \]

Let's look at some examples

The linear model form

A basic regression type thing

plot of chunk unnamed-chunk-1

Which fits the linear model for signal

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \end{bmatrix} \]

The linear model form

A cubic regression thing:

plot of chunk unnamed-chunk-2

Which fits the linear model for signal

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3\\ 1 & x_2 & x_2^2 & x_2^3\\ \vdots & \vdots & \vdots\\ 1 & x_n & x_n^2 & x_n^3\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_3}\\ \end{bmatrix} \]

The linear model form

A plane (could be hyper-plane with more \( x \))

plot of chunk unnamed-chunk-3

Which fits the linear model for signal

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1}\\ 1 & x_{1,2} & x_{2,2}\\ \vdots & \vdots & \vdots\\ 1 & x_{1,n} & x_{2,n}\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} \]

Dummy variables

We frequently need to put categorical variables into an equation - demanding some numeric representation

We do this using dummy variable encoding/representation (comp. sci. focussed people use the ridiculous term one-hot encoding)

We cover this here - linear models encompass this

Dummy variables

We just need to get clever with the design matrix i.e. the bit on the LHS of the regression equation expressing our covariates

  • We create a binary matrix
  • The 0/1 codings indicate which level of the factor variable (our categorical covariate) each observation belongs to
  • Each level therefore gets a parameter in the parameter vector

Dummy variables

So - say \( X \) has 3 levels, with the first 6 values being \( \mathbf{x} = [A, A, B, B, C, C] \), we might encode:

\[ \mathbf{x} \rightarrow \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 0 & 0 & 1\\ \end{bmatrix} \]

A “1” in the first column means level A, a “1” in the second column means level B and a “1” in the third column means level C

Dummy variables

Through the magic of matrix multiplication:

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 0 & 0 & 1\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_2}\\ \end{bmatrix} \]

So, with an appropriate fitting objective, the estimated \( \hat{\beta_j} \) could be means of each factor level (A, B, C).

Usually however, we make one level the baseline, so some parameters are estimated differences between the baseline and other factor levels i.e. the bit you add to go from the baseline to another level.

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 1 & 0\\ 1 & 1 & 0\\ 1 & 0 & 1\\ 1 & 0 & 1\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_0}\\ \hat{\beta_0} + \hat{\beta_1}\\ \hat{\beta_0} + \hat{\beta_1}\\ \hat{\beta_0} + \hat{\beta_2}\\ \hat{\beta_0} + \hat{\beta_2}\\ \end{bmatrix} \]

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 1 & 0\\ 1 & 1 & 0\\ 1 & 0 & 1\\ 1 & 0 & 1\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_0}\\ \hat{\beta_0} + \hat{\beta_1}\\ \hat{\beta_0} + \hat{\beta_1}\\ \hat{\beta_0} + \hat{\beta_2}\\ \hat{\beta_0} + \hat{\beta_2}\\ \end{bmatrix} \]

  • \( \hat{\beta_0} \) might be the mean of the baseline factor (A)
  • \( \hat{\beta_1} \) might be the mean difference between baseline and level 2 (B - A)
  • \( \hat{\beta_2} \) might be the mean difference between baseline and level 3 (C - A)

Combining continuous and categorical covariates

Armed with this idea of constructing design matrices to give complex models, we can combine any number of categorical and numeric covariates

Combining continuous and categorical covariates

Through the magic of matrix multiplication:

\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & 0 & x_{1,2}\\ 1 & 0 & 0 & x_{2,2}\\ 1 & 1 & 0 & x_{3,2}\\ 1 & 1 & 0 & x_{4,2}\\ 1 & 0 & 1 & x_{5,2}\\ 1 & 0 & 1 & x_{6,2}\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_3}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0} + \hat{\beta_3}x_{1,2}\\ \hat{\beta_0} + \hat{\beta_3}x_{2,2}\\ \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_3}x_{3,2}\\ \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_3}x_{4,2}\\ \hat{\beta_0} + \hat{\beta_2} + \hat{\beta_3}x_{5,2}\\ \hat{\beta_0} + \hat{\beta_2} + \hat{\beta_3}x_{6,2}\\ \end{bmatrix} \]

So, with an appropriate fitting objective, we now estimate a strightline relationship between \( x_2 \) and some adjustment for the factor level of \( x_1 \)

Doing one "by hand"

i.e. in R contrive some data - 6 pts with a 3-level factor variable.

  • \( y \) is related to these with Normal noise:
    • mean of \( y \) given A = 2 and x_2 = 0
    • mean of \( y \) given B = 10 and x_2 = 0
    • mean of \( y \) given C = 20 and x_2 = 0
  set.seed(653465)

  x_1 <- factor(c("A", "A", "B", "B", "C", "C"))
  x_2 <- runif(6, 0, 10)

  y <- rnorm(6, c(2, 2, 10, 10, 20, 20)) + 2*x_2

  plot(x_2, y, pch = 20, col = as.numeric(x_1), cex = 4)

plot of chunk unnamed-chunk-5

Doing one "by hand"

We construct the design matrix i.e. create dummy variables

  x_Design <- cbind(rep(1, 6), rep(c(0,1,0), c(2,2,2)), 
                    rep(c(0,0,1), c(2,2,2)), x_2)

  lmData <- data.frame(y, x_Design)  

  lmData
          y V1 V2 V3      x_2
1 11.491858  1  0  0 5.303632
2  5.528969  1  0  0 1.356557
3 17.623767  1  1  0 4.235338
4 30.908142  1  1  0 9.813859
5 34.429791  1  0  1 7.826814
6 24.410992  1  0  1 2.542402

Doing one "by hand"

We now find the 'best' parameters (same as OLS seen previously)

  exampleLM <- lm(y ~ . -1, data = lmData)

  summary(exampleLM)

Call:
lm(formula = y ~ . - 1, data = lmData)

Residuals:
      1       2       3       4       5       6 
-1.0010  1.0010 -1.0137  1.0137 -0.3224  0.3224 

Coefficients:
    Estimate Std. Error t value Pr(>|t|)   
V1    1.7905     1.3042   1.373  0.30344   
V2    8.3003     1.7070   4.862  0.03979 * 
V3   17.1677     1.5265  11.247  0.00781 **
x_2   2.0179     0.2391   8.439  0.01375 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.461 on 2 degrees of freedom
Multiple R-squared:  0.9987,    Adjusted R-squared:  0.996 
F-statistic: 375.6 on 4 and 2 DF,  p-value: 0.002657

Doing one "by hand"

However, normally we let R do that bit e.g. automatically in lm. You'll observe the same estimates etc.

  exampleLM <- lm(y ~  x_1 + x_2, data = lmData)

  summary(exampleLM)

Call:
lm(formula = y ~ x_1 + x_2, data = lmData)

Residuals:
      1       2       3       4       5       6 
-1.0010  1.0010 -1.0137  1.0137 -0.3224  0.3224 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   1.7905     1.3042   1.373  0.30344   
x_1B          8.3003     1.7070   4.862  0.03979 * 
x_1C         17.1677     1.5265  11.247  0.00781 **
x_2           2.0179     0.2391   8.439  0.01375 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.461 on 2 degrees of freedom
Multiple R-squared:  0.9932,    Adjusted R-squared:  0.9831 
F-statistic:  97.9 on 3 and 2 DF,  p-value: 0.01013

Doing one "by hand"

What is this saying? Let's just take the estimated coefficients.

  coef(exampleLM)
(Intercept)        x_1B        x_1C         x_2 
   1.790517    8.300291   17.167700    2.017930 
  • For starters, level A isn't listed
  • It is the baseline (i.e. “contained” in the intercept)
  • When x_2 is zero and you are in the baseline class(es), the predicted mean value of y is the “(Intercept)” parameter (here 1.79)

Doing one "by hand"

What is this saying?

  coef(exampleLM)
(Intercept)        x_1B        x_1C         x_2 
   1.790517    8.300291   17.167700    2.017930 
  • As x_2 increases, so does y. On average, 2.02 units per unit increase of x_2. Recall the underlying value was 2.
  • If you're in class A, you're on average 1.79 when x_2 is zero. Recall underlying value was 2.
  • Class B is on average 8.3 units higher than A (regardless of x_2). Recall underlying difference was 8.
  • Class C is on average 17.17 units higher than A (regardless of x_2). Recall underlying difference was 18.

Doing one "by hand"

We've effectively fitted 3 lines to this data:

  fittedCoefs <- coef(exampleLM)

  plot(x_2, y, pch = 20, col = as.numeric(x_1), cex = 4)
  abline(a = fittedCoefs[1], b = fittedCoefs[4])
  abline(a = fittedCoefs[1] + fittedCoefs[2], b = fittedCoefs[4], col = 2)
  abline(a = fittedCoefs[1] + fittedCoefs[3], b = fittedCoefs[4], col = 3)

plot of chunk unnamed-chunk-12

Clear? Perhaps not - explanation ensues

Winding back to previous work

Now we know a bit more, we can look at our previous models through the prism of a linear model

\(t\)-test revisited

A \( t \)-test run on the weed data

library(tidyverse)

weedData <- read.csv("data/CompletePotPlants.csv", header = T)

CaData <- weedData %>% select(Ca, Group) %>% filter(Group != 'mb' & Group !='bhb')

(turning off the Welch variant for comparability: var.equal = T)

t.test(Ca ~ Group, data = CaData, var.equal = T)

    Two Sample t-test

data:  Ca by Group
t = 6.7205, df = 31, p-value = 1.609e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 16648.81 31156.75
sample estimates:
mean in group nth  mean in group pm 
         54111.11          30208.33 

\(t\)-test revisited

We should be able to fit something similar with a linear model, where the categorical covariate is dummy-coded

  caLM <- lm(Ca ~ Group, data = CaData)

  summary(caLM)

Call:
lm(formula = Ca ~ Group, data = CaData)

Residuals:
   Min     1Q Median     3Q    Max 
-16111  -7208   1792   4792  19792 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    54111       3033   17.84  < 2e-16 ***
Grouppm       -23903       3557   -6.72 1.61e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9100 on 31 degrees of freedom
Multiple R-squared:  0.593, Adjusted R-squared:  0.5799 
F-statistic: 45.16 on 1 and 31 DF,  p-value: 1.609e-07

\(t\)-test revisited

We should be able to fit something similar with a linear model, where the categorical covariate is dummy-coded

  summary(caLM)

Call:
lm(formula = Ca ~ Group, data = CaData)

Residuals:
   Min     1Q Median     3Q    Max 
-16111  -7208   1792   4792  19792 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    54111       3033   17.84  < 2e-16 ***
Grouppm       -23903       3557   -6.72 1.61e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9100 on 31 degrees of freedom
Multiple R-squared:  0.593, Adjusted R-squared:  0.5799 
F-statistic: 45.16 on 1 and 31 DF,  p-value: 1.609e-07

\(t\)-test revisited

  • Observe intercept is mean of one group
  • The Grouppm estimate is the difference
  • The test on this parameter (and \( p \)-value at bottom) are the same as the \( t \)-test

ANOVA revisited

Similarly, we might expect to conduct an ANOVA-like thing via lm

  p <- ggplot(TiData) + geom_boxplot(aes(Group, Ti), fill = 'purple', alpha = 0.8)

  p

plot of chunk unnamed-chunk-19

ANOVA revisited

The ANOVA provides:

  weedANOVA <- aov(Ti ~ Group, data = TiData)

  summary(weedANOVA)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Group        2 118.60   59.30   28.31 1.43e-08 ***
Residuals   43  90.06    2.09                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA revisited

  TukeyHSD(weedANOVA)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Ti ~ Group, data = TiData)

$Group
            diff        lwr      upr     p adj
nth-bhb 2.477778  0.9544691 4.001086 0.0008236
pm-bhb  3.750000  2.5402571 4.959743 0.0000000
pm-nth  1.272222 -0.1008697 2.645314 0.0743237

ANOVA revisited

Using lm (which dummy-codes Group):

  weedLM <- lm(Ti ~ Group, data = TiData)

  summary(weedLM)

Call:
lm(formula = Ti ~ Group, data = TiData)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0000 -0.8208 -0.1750  1.0250  4.1500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.1000     0.4014  12.706 3.74e-16 ***
Groupnth      2.4778     0.6275   3.948 0.000287 ***
Grouppm       3.7500     0.4984   7.525 2.26e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.447 on 43 degrees of freedom
Multiple R-squared:  0.5684,    Adjusted R-squared:  0.5483 
F-statistic: 28.31 on 2 and 43 DF,  p-value: 1.427e-08

ANOVA revisited

Using lm (which dummy-codes Group):

  coef(weedLM)
(Intercept)    Groupnth     Grouppm 
   5.100000    2.477778    3.750000 
  • Observe estimate of baseline mean
  • Estimates of differences between baseline and nth and pm
  • Overall \( F \)-stat, df and \( p \)-values match

ANOVA revisited

A nice thing about doing things as linear models, is our processes for getting estimates, errors, assumption checking etc use the same tools each time:

  qqnorm(weedLM$residuals)
  qqline(weedLM$residuals)

ANOVA revisited

Residuals having the usual interpretation - an estimate of the errors, (observed - expected), \( y_i-\hat{y_i} \)

plot of chunk unnamed-chunk-25

ANOVA revisited

So, \( t \)-test, ANOVA, regression, whatever, will conduct examinations of residuals. They're all linear models.

  shapiro.test(weedLM$residuals)

    Shapiro-Wilk normality test

data:  weedLM$residuals
W = 0.98038, p-value = 0.6216

"ANCOVA"

An ANalysis of COVAriance (ANCOVA) is sometimes referred to and is simply a mix of a numeric and categorical covariate for predicting numeric \( y \) (assuming Normal noise):

  • It is of course, a linear model
  • It may involve an interaction, which we cover later
  • You should be familar with the term, even though it is just a simple case of a linear model

Putting these into practice

So now we'll turn the handle to make more complex models of signal

The model for noise (independent draws from a Normal distribution) remains the same throughout the linear models

Fitting to some data - loan data:

  • BAD - The target variable/response, code as 1 for a loan default and 0 if paid back.
  • REASON - What the loan was for: HomeImp for home improvement and DebtCon for debt consolidation.
  • JOB - a set of six occupational categories.
  • MORTDUE - the amount remaining on the existing mortgage.
  • VALUE - value of the current property
  • DEBTINC the debt-to-income ratio i.e. if this is high, then the person owes a lot relative to their income (they're highly indebted).
  • YOJ - the number of years at the present job.
  • DEROG - the number of major derogatory reports
  • CLNO - the number of trade lines
  • DELINQ - the number of delinquent trade lines
  • CLAGE - the age of the oldest trade line (months)
  • NINQ - the number of recent credit checks.

Fitting to some data

We'll see if we can predict debt-to-income ratio (DEBTINC) on the basis of REASON (categorical) and VALUE (numeric)

  loanData <- read_csv("data/hmeq.csv")

  head(loanData)
# A tibble: 6 x 13
    BAD  LOAN MORTDUE  VALUE REASON JOB     YOJ DEROG DELINQ CLAGE  NINQ
  <dbl> <dbl>   <dbl>  <dbl> <chr>  <chr> <dbl> <dbl>  <dbl> <dbl> <dbl>
1     1  1100   25860  39025 HomeI~ Other  10.5     0      0  94.4     1
2     1  1300   70053  68400 HomeI~ Other   7       0      2 122.      0
3     1  1500   13500  16700 HomeI~ Other   4       0      0 149.      1
4     1  1500      NA     NA <NA>   <NA>   NA      NA     NA  NA      NA
5     0  1700   97800 112000 HomeI~ Offi~   3       0      0  93.3     0
6     1  1700   30548  40320 HomeI~ Other   9       0      0 101.      1
# ... with 2 more variables: CLNO <dbl>, DEBTINC <dbl>

Fitting to some data

Specify the linear model: \( y = f(Reason, Value) \), where the function/signal is of the linear form \( \mathbf{y} = \mathbf{X}\boldsymbol{\beta} \) and our model for noise is independent draws from a Normal distribution.

  loanData <- loanData %>% select(DEBTINC, REASON, VALUE) %>% na.omit()

  loanLM <- lm(DEBTINC ~ REASON + VALUE, data = loanData)  

  summary(loanLM)

Call:
lm(formula = DEBTINC ~ REASON + VALUE, data = loanData)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.815  -4.654   0.890   4.988 168.448 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.232e+01  2.636e-01 122.619  < 2e-16 ***
REASONHomeImp -1.002e+00  2.646e-01  -3.786 0.000155 ***
VALUE          1.915e-05  2.132e-06   8.984  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.145 on 4473 degrees of freedom
Multiple R-squared:  0.02057,   Adjusted R-squared:  0.02013 
F-statistic: 46.98 on 2 and 4473 DF,  p-value: < 2.2e-16

DebtCon HomeImp 
   3114    1362 

Interpretation

  • All estimates are statistically significant
  • The intercept parameter test is not interesting (why?)
  • The REASONHomeImp says Debt Consolidation loans are associated with higher Debt-to-income ratios (1 unit in fact)
  • As VALUE increases by 1 unit, average DEBTINC is predicted to increase by 1.9e-5 units

Interpretation

  • The model is highly significant, but little variance is explained (it's not a great predictor apparently)
  • We can get confidence intervals about the estimates, which could be further interpreted.
  confint(loanLM)
                      2.5 %        97.5 %
(Intercept)    3.180781e+01  3.284145e+01
REASONHomeImp -1.520570e+00 -4.829794e-01
VALUE          1.497405e-05  2.333322e-05
  • Note 0 is not contained/plausible, in keeping with the highly significant test results.

Partially check the noise model suits

They're linear models - a simple model for noise applies, we check some shape aspects like so:

  shapiro.test(resid(loanLM))

    Shapiro-Wilk normality test

data:  resid(loanLM)
W = 0.8326, p-value < 2.2e-16
  qqnorm(residuals(loanLM))

plot of chunk unnamed-chunk-32

Partially check the noise model suits

We can actually get a bunch of plots by plotting the model object

  plot(loanLM)

plot of chunk unnamed-chunk-34

Partially check the noise model suits

Or get specific ones by actually working:

  plot(loanData$VALUE, resid(loanLM))

plot of chunk unnamed-chunk-36

Recap and look-forwards

We've covered:

  • The basic linear model structure for the signal
  • Alluded to the assumed noise model
  • Dummy-coded to include categorical covariates
  • Seen a little of the wide range of signal afforded
  • Fitted, interpreted and partially checked models

Next:

  • Maximum likelihood in a bit more detail
  • Somewhat conceptually, how this finds our parameter estimates