C. Donovan
Here we look to unify several of the models seen previously (which generated CIs and \( p \)-values)
The class of models are Linear Models
They have a specific structure for the signal and noise
My favourite equation
Many models are effectively regressions - they are broadly of the form:
\[ y = f(x_1,..., x_p) + noise \]
So we need to estimate the signal \( f \), which is usually contingent on an idea of the distribution of the error (because we have to tease the signal and noise apart)
As we'll see, our estimates for the signal are optimised in light of our model for noise
The term linear can be used differently in different contexts
\[ y = \mathbf{X}\boldsymbol{\beta} \]
\[ y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p \]
This is veery wide. Naturally it includes a straight line relating \( x \) and \( y \):
\[ y = \beta_0 + \beta_1 x_1 \]
or indeed just modelling \( y \) as a constant (a mean would be best)
\[ y = \beta_0 \]
Let's look at some examples
A basic regression type thing
Which fits the linear model for signal
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \end{bmatrix} \]
A cubic regression thing:
Which fits the linear model for signal
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3\\ 1 & x_2 & x_2^2 & x_2^3\\ \vdots & \vdots & \vdots\\ 1 & x_n & x_n^2 & x_n^3\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_3}\\ \end{bmatrix} \]
A plane (could be hyper-plane with more \( x \))
Which fits the linear model for signal
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1}\\ 1 & x_{1,2} & x_{2,2}\\ \vdots & \vdots & \vdots\\ 1 & x_{1,n} & x_{2,n}\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} \]
We frequently need to put categorical variables into an equation - demanding some numeric representation
We do this using dummy variable encoding/representation (comp. sci. focussed people use the ridiculous term one-hot encoding)
We cover this here - linear models encompass this
We just need to get clever with the design matrix i.e. the bit on the LHS of the regression equation expressing our covariates
So - say \( X \) has 3 levels, with the first 6 values being \( \mathbf{x} = [A, A, B, B, C, C] \), we might encode:
\[ \mathbf{x} \rightarrow \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 0 & 0 & 1\\ \end{bmatrix} \]
A “1” in the first column means level A, a “1” in the second column means level B and a “1” in the third column means level C
Through the magic of matrix multiplication:
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 0 & 0 & 1\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_2}\\ \end{bmatrix} \]
So, with an appropriate fitting objective, the estimated \( \hat{\beta_j} \) could be means of each factor level (A, B, C).
Usually however, we make one level the baseline, so some parameters are estimated differences between the baseline and other factor levels i.e. the bit you add to go from the baseline to another level.
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 1 & 0\\ 1 & 1 & 0\\ 1 & 0 & 1\\ 1 & 0 & 1\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_0}\\ \hat{\beta_0} + \hat{\beta_1}\\ \hat{\beta_0} + \hat{\beta_1}\\ \hat{\beta_0} + \hat{\beta_2}\\ \hat{\beta_0} + \hat{\beta_2}\\ \end{bmatrix} \]
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 1 & 0\\ 1 & 1 & 0\\ 1 & 0 & 1\\ 1 & 0 & 1\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_0}\\ \hat{\beta_0} + \hat{\beta_1}\\ \hat{\beta_0} + \hat{\beta_1}\\ \hat{\beta_0} + \hat{\beta_2}\\ \hat{\beta_0} + \hat{\beta_2}\\ \end{bmatrix} \]
Armed with this idea of constructing design matrices to give complex models, we can combine any number of categorical and numeric covariates
Through the magic of matrix multiplication:
\[ \hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\hat{\beta}} = \begin{bmatrix} 1 & 0 & 0 & x_{1,2}\\ 1 & 0 & 0 & x_{2,2}\\ 1 & 1 & 0 & x_{3,2}\\ 1 & 1 & 0 & x_{4,2}\\ 1 & 0 & 1 & x_{5,2}\\ 1 & 0 & 1 & x_{6,2}\\ \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_3}\\ \end{bmatrix} = \begin{bmatrix} \hat{\beta_0} + \hat{\beta_3}x_{1,2}\\ \hat{\beta_0} + \hat{\beta_3}x_{2,2}\\ \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_3}x_{3,2}\\ \hat{\beta_0} + \hat{\beta_1} + \hat{\beta_3}x_{4,2}\\ \hat{\beta_0} + \hat{\beta_2} + \hat{\beta_3}x_{5,2}\\ \hat{\beta_0} + \hat{\beta_2} + \hat{\beta_3}x_{6,2}\\ \end{bmatrix} \]
So, with an appropriate fitting objective, we now estimate a strightline relationship between \( x_2 \) and some adjustment for the factor level of \( x_1 \)
i.e. in R contrive some data - 6 pts with a 3-level factor variable.
A
= 2 and x_2
= 0B
= 10 and x_2
= 0C
= 20 and x_2
= 0 set.seed(653465)
x_1 <- factor(c("A", "A", "B", "B", "C", "C"))
x_2 <- runif(6, 0, 10)
y <- rnorm(6, c(2, 2, 10, 10, 20, 20)) + 2*x_2
plot(x_2, y, pch = 20, col = as.numeric(x_1), cex = 4)
We construct the design matrix i.e. create dummy variables
x_Design <- cbind(rep(1, 6), rep(c(0,1,0), c(2,2,2)),
rep(c(0,0,1), c(2,2,2)), x_2)
lmData <- data.frame(y, x_Design)
lmData
y V1 V2 V3 x_2
1 11.491858 1 0 0 5.303632
2 5.528969 1 0 0 1.356557
3 17.623767 1 1 0 4.235338
4 30.908142 1 1 0 9.813859
5 34.429791 1 0 1 7.826814
6 24.410992 1 0 1 2.542402
We now find the 'best' parameters (same as OLS seen previously)
exampleLM <- lm(y ~ . -1, data = lmData)
summary(exampleLM)
Call:
lm(formula = y ~ . - 1, data = lmData)
Residuals:
1 2 3 4 5 6
-1.0010 1.0010 -1.0137 1.0137 -0.3224 0.3224
Coefficients:
Estimate Std. Error t value Pr(>|t|)
V1 1.7905 1.3042 1.373 0.30344
V2 8.3003 1.7070 4.862 0.03979 *
V3 17.1677 1.5265 11.247 0.00781 **
x_2 2.0179 0.2391 8.439 0.01375 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.461 on 2 degrees of freedom
Multiple R-squared: 0.9987, Adjusted R-squared: 0.996
F-statistic: 375.6 on 4 and 2 DF, p-value: 0.002657
However, normally we let R do that bit e.g. automatically in lm
. You'll observe the same estimates etc.
exampleLM <- lm(y ~ x_1 + x_2, data = lmData)
summary(exampleLM)
Call:
lm(formula = y ~ x_1 + x_2, data = lmData)
Residuals:
1 2 3 4 5 6
-1.0010 1.0010 -1.0137 1.0137 -0.3224 0.3224
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7905 1.3042 1.373 0.30344
x_1B 8.3003 1.7070 4.862 0.03979 *
x_1C 17.1677 1.5265 11.247 0.00781 **
x_2 2.0179 0.2391 8.439 0.01375 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.461 on 2 degrees of freedom
Multiple R-squared: 0.9932, Adjusted R-squared: 0.9831
F-statistic: 97.9 on 3 and 2 DF, p-value: 0.01013
What is this saying? Let's just take the estimated coefficients.
coef(exampleLM)
(Intercept) x_1B x_1C x_2
1.790517 8.300291 17.167700 2.017930
x_2
is zero and you are in the baseline class(es), the predicted mean value of y
is the “(Intercept)” parameter (here 1.79)What is this saying?
coef(exampleLM)
(Intercept) x_1B x_1C x_2
1.790517 8.300291 17.167700 2.017930
x_2
increases, so does y
. On average, 2.02 units per unit increase of x_2
. Recall the underlying value was 2.x_2
is zero. Recall underlying value was 2.x_2
). Recall underlying difference was 8.x_2
). Recall underlying difference was 18.We've effectively fitted 3 lines to this data:
fittedCoefs <- coef(exampleLM)
plot(x_2, y, pch = 20, col = as.numeric(x_1), cex = 4)
abline(a = fittedCoefs[1], b = fittedCoefs[4])
abline(a = fittedCoefs[1] + fittedCoefs[2], b = fittedCoefs[4], col = 2)
abline(a = fittedCoefs[1] + fittedCoefs[3], b = fittedCoefs[4], col = 3)
Clear? Perhaps not - explanation ensues
Now we know a bit more, we can look at our previous models through the prism of a linear model
A \( t \)-test run on the weed data
library(tidyverse)
weedData <- read.csv("data/CompletePotPlants.csv", header = T)
CaData <- weedData %>% select(Ca, Group) %>% filter(Group != 'mb' & Group !='bhb')
(turning off the Welch variant for comparability: var.equal = T
)
t.test(Ca ~ Group, data = CaData, var.equal = T)
Two Sample t-test
data: Ca by Group
t = 6.7205, df = 31, p-value = 1.609e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
16648.81 31156.75
sample estimates:
mean in group nth mean in group pm
54111.11 30208.33
We should be able to fit something similar with a linear model, where the categorical covariate is dummy-coded
caLM <- lm(Ca ~ Group, data = CaData)
summary(caLM)
Call:
lm(formula = Ca ~ Group, data = CaData)
Residuals:
Min 1Q Median 3Q Max
-16111 -7208 1792 4792 19792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54111 3033 17.84 < 2e-16 ***
Grouppm -23903 3557 -6.72 1.61e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9100 on 31 degrees of freedom
Multiple R-squared: 0.593, Adjusted R-squared: 0.5799
F-statistic: 45.16 on 1 and 31 DF, p-value: 1.609e-07
We should be able to fit something similar with a linear model, where the categorical covariate is dummy-coded
summary(caLM)
Call:
lm(formula = Ca ~ Group, data = CaData)
Residuals:
Min 1Q Median 3Q Max
-16111 -7208 1792 4792 19792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54111 3033 17.84 < 2e-16 ***
Grouppm -23903 3557 -6.72 1.61e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9100 on 31 degrees of freedom
Multiple R-squared: 0.593, Adjusted R-squared: 0.5799
F-statistic: 45.16 on 1 and 31 DF, p-value: 1.609e-07
Grouppm
estimate is the differenceSimilarly, we might expect to conduct an ANOVA-like thing via lm
p <- ggplot(TiData) + geom_boxplot(aes(Group, Ti), fill = 'purple', alpha = 0.8)
p
The ANOVA provides:
weedANOVA <- aov(Ti ~ Group, data = TiData)
summary(weedANOVA)
Df Sum Sq Mean Sq F value Pr(>F)
Group 2 118.60 59.30 28.31 1.43e-08 ***
Residuals 43 90.06 2.09
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(weedANOVA)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Ti ~ Group, data = TiData)
$Group
diff lwr upr p adj
nth-bhb 2.477778 0.9544691 4.001086 0.0008236
pm-bhb 3.750000 2.5402571 4.959743 0.0000000
pm-nth 1.272222 -0.1008697 2.645314 0.0743237
Using lm
(which dummy-codes Group
):
weedLM <- lm(Ti ~ Group, data = TiData)
summary(weedLM)
Call:
lm(formula = Ti ~ Group, data = TiData)
Residuals:
Min 1Q Median 3Q Max
-3.0000 -0.8208 -0.1750 1.0250 4.1500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1000 0.4014 12.706 3.74e-16 ***
Groupnth 2.4778 0.6275 3.948 0.000287 ***
Grouppm 3.7500 0.4984 7.525 2.26e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.447 on 43 degrees of freedom
Multiple R-squared: 0.5684, Adjusted R-squared: 0.5483
F-statistic: 28.31 on 2 and 43 DF, p-value: 1.427e-08
Using lm
(which dummy-codes Group
):
coef(weedLM)
(Intercept) Groupnth Grouppm
5.100000 2.477778 3.750000
nth
and pm
A nice thing about doing things as linear models, is our processes for getting estimates, errors, assumption checking etc use the same tools each time:
qqnorm(weedLM$residuals)
qqline(weedLM$residuals)
Residuals having the usual interpretation - an estimate of the errors, (observed - expected), \( y_i-\hat{y_i} \)
So, \( t \)-test, ANOVA, regression, whatever, will conduct examinations of residuals. They're all linear models.
shapiro.test(weedLM$residuals)
Shapiro-Wilk normality test
data: weedLM$residuals
W = 0.98038, p-value = 0.6216
An ANalysis of COVAriance (ANCOVA) is sometimes referred to and is simply a mix of a numeric and categorical covariate for predicting numeric \( y \) (assuming Normal noise):
So now we'll turn the handle to make more complex models of signal
The model for noise (independent draws from a Normal distribution) remains the same throughout the linear models
We'll see if we can predict debt-to-income ratio (DEBTINC
) on the basis of REASON
(categorical) and VALUE
(numeric)
loanData <- read_csv("data/hmeq.csv")
head(loanData)
# A tibble: 6 x 13
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1100 25860 39025 HomeI~ Other 10.5 0 0 94.4 1
2 1 1300 70053 68400 HomeI~ Other 7 0 2 122. 0
3 1 1500 13500 16700 HomeI~ Other 4 0 0 149. 1
4 1 1500 NA NA <NA> <NA> NA NA NA NA NA
5 0 1700 97800 112000 HomeI~ Offi~ 3 0 0 93.3 0
6 1 1700 30548 40320 HomeI~ Other 9 0 0 101. 1
# ... with 2 more variables: CLNO <dbl>, DEBTINC <dbl>
Specify the linear model: \( y = f(Reason, Value) \), where the function/signal is of the linear form \( \mathbf{y} = \mathbf{X}\boldsymbol{\beta} \) and our model for noise is independent draws from a Normal distribution.
loanData <- loanData %>% select(DEBTINC, REASON, VALUE) %>% na.omit()
loanLM <- lm(DEBTINC ~ REASON + VALUE, data = loanData)
summary(loanLM)
Call:
lm(formula = DEBTINC ~ REASON + VALUE, data = loanData)
Residuals:
Min 1Q Median 3Q Max
-33.815 -4.654 0.890 4.988 168.448
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.232e+01 2.636e-01 122.619 < 2e-16 ***
REASONHomeImp -1.002e+00 2.646e-01 -3.786 0.000155 ***
VALUE 1.915e-05 2.132e-06 8.984 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.145 on 4473 degrees of freedom
Multiple R-squared: 0.02057, Adjusted R-squared: 0.02013
F-statistic: 46.98 on 2 and 4473 DF, p-value: < 2.2e-16
DebtCon HomeImp
3114 1362
REASONHomeImp
says Debt Consolidation loans are associated with higher Debt-to-income ratios (1 unit in fact)VALUE
increases by 1 unit, average DEBTINC
is predicted to increase by 1.9e-5 units confint(loanLM)
2.5 % 97.5 %
(Intercept) 3.180781e+01 3.284145e+01
REASONHomeImp -1.520570e+00 -4.829794e-01
VALUE 1.497405e-05 2.333322e-05
They're linear models - a simple model for noise applies, we check some shape aspects like so:
shapiro.test(resid(loanLM))
Shapiro-Wilk normality test
data: resid(loanLM)
W = 0.8326, p-value < 2.2e-16
qqnorm(residuals(loanLM))
We can actually get a bunch of plots by plotting the model object
plot(loanLM)
Or get specific ones by actually working:
plot(loanData$VALUE, resid(loanLM))
We've covered:
Next: