Intro to Linear Models

Juan Steibel

Basic Linear model

\[ \color{blue}{y_i} = \sum_{j=1}^{p} {\color{red}{x_{ij}} \color{darkgreen}{\beta_j}}+\color{orange}{e_i} \]

\[ E(\color{orange}{e_i}) = 0 \qquad(1)\]\(\color{blue}{y_i}\) observation of response variable,

\(\color{red}{x_i}\) : observation of predictor variable, \(\color{darkgreen}{\beta_j}\) : linear coefficient \(\color{orange}{e_i}\) : residual value In matrix form: \[ \boldsymbol{\color{blue}y}=\boldsymbol{\color{red}X}\boldsymbol{\color{darkgreen}\beta}+\boldsymbol{\color{orange}e} \qquad(2)\]

\(rank(\boldsymbol{\color{red}X}) = p\) = matrix is full column rank. Class question.

Example 1: Longley data (only 6 obs)

     GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
1949         88.2 258.054      368.2        161.6    109.773 1949   60.171
1950         89.5 284.599      335.1        165.0    110.929 1950   61.187
1951         96.2 328.975      209.9        309.9    112.075 1951   63.221
1952         98.1 346.999      193.2        359.4    113.270 1952   63.639

\[ Employed_i=\alpha+GNP_i \beta+ e_i, \boldsymbol{y}=\begin{bmatrix}60.323 \\61.122 \\60.171 \\61.187 \\63.221 \\63.639 \\\end{bmatrix}, \boldsymbol{X} =\begin{bmatrix}1&234.289 \\1&259.426 \\1&258.054 \\1&284.599 \\1&328.975 \\1&346.999 \\\end{bmatrix} \]

Model assumptions and chosing \(\beta\)

\(E(\boldsymbol{e}) = 0\), Makes no further distributional assumptions. Then: \(E(\boldsymbol{\color{blue}y})=\boldsymbol{\color{red}X}\boldsymbol{\color{darkgreen}\beta}\)

What is the best fitting line?

Model assumptions and chosing \(\beta\)

\(E(\boldsymbol{e}) = 0\), Makes no further distributional assumptions. Then: \(E(\boldsymbol{\color{blue}y})=\boldsymbol{\color{red}X}\boldsymbol{\color{darkgreen}\beta}\)

All these lines produce residuals with expectation 0:

We need another criteria to estimate \(\beta\)

Ordinary Least Squares

Set the parameters that minimize this: \(SSE=\sum_{i=1} ^n (\hat{e}_i^2) = \boldsymbol{e^{'}}\boldsymbol{e}=(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})^{'}(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})\)

This means we need to take the derivative of SSE with respect to \(\hat{\boldsymbol{\beta}}\) (not shown).

Ordinary Least Squares

Set the parameters that minimize this: \(SSE = \sum_{i=1} ^n (\hat{e}_i^2) = \boldsymbol{e^{'}}\boldsymbol{e}=(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})^{'}(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})\)

This is called ordinary the least squares estimate or OLS of \(\boldsymbol{\beta}\) And it computed as follows:

\[\hat{\boldsymbol{\beta}} =(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y} \] Thus, the OLS of observations can be written:

\[ \hat{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}} = \boldsymbol{X}(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y} = \boldsymbol{P_{X}}\boldsymbol{y} \]

With \(\boldsymbol{P_X}\) = Projection matrix (projects response on column space of predictors in \(\boldsymbol{X}\))

Example 1:

\[ \boldsymbol{X^{'}} \boldsymbol{X} = \begin{bmatrix} 16& 6203 \\ 6203&2553152 \\\end{bmatrix}, \boldsymbol{X^{'}} \boldsymbol{X}^{-1} = \begin{bmatrix} 1.1e+00&-2.6e-03 \\-2.6e-03& 6.7e-06 \\\end{bmatrix} \] \[ \boldsymbol{X}^{'}\boldsymbol{y}=\begin{bmatrix} 1045 \\410323 \\\end{bmatrix}, \boldsymbol{\hat{\beta}}=\begin{bmatrix}51.844 \\ 0.035 \\\end{bmatrix} \] Predicted y is \(\boldsymbol{\hat{y}}=\boldsymbol{X}\boldsymbol{\hat{\beta}}\) (10 obs)

\[ \begin{bmatrix}60&60.9&60.8&61.7&63.3&63.9&64.5&64.5&65.7&66.4 \\\end{bmatrix} \] Observed y:

\[ \begin{bmatrix}60.3&61.1&60.2&61.2&63.2&63.6&65&63.8&66&67.9 \\\end{bmatrix} \]

Properties of the projection matrix


\(\boldsymbol{P_X}\) is an important matrix in mixed models. It’s the projection matrix and we will learn other projection matrices in this course.

  • Symmetric: \(\boldsymbol{P_X} = \boldsymbol{P_X^{'}}\)

  • Idempotent: \(\boldsymbol{P_X} = \boldsymbol{P_X} \boldsymbol{P_X}\)

  • \(\boldsymbol{P_X} \boldsymbol{X} = \boldsymbol{X}\)

  • \(rank(\boldsymbol{P_X}) = rank(\boldsymbol{X}) = tr(\boldsymbol{P_X}) = p\)

  • \(\boldsymbol{\hat{e}}=( \boldsymbol{I}-\boldsymbol{P_X})\boldsymbol{y}\)

  • \(( \boldsymbol{I}-\boldsymbol{P_X})( \boldsymbol{I}-\boldsymbol{P_X})=( \boldsymbol{I}-\boldsymbol{P_X})\)

Unbiased estimator of the variance

an unbiased estimate of the error variance, \(var(\boldsymbol{e})=\sigma^2_e\) is: \[ \hat{\sigma}_e^2=\frac{\boldsymbol{y}^{'}( \boldsymbol{I}-\boldsymbol{P_X} )\boldsymbol{y}} {n-p} \] Also, note that: \[ SSE=\boldsymbol{y}^{'}( \boldsymbol{I}-\boldsymbol{P_X} )\boldsymbol{y} \] The proof of this requires a bit of matrix algebra, you are welcome to try it and ask me questions.

Example 1:

\[ \hat{\sigma}_e^2= 0.43, \hat{\sigma}_e= 0.66, \boldsymbol{\hat{\beta}}=\begin{bmatrix}51.844 \\ 0.035 \\\end{bmatrix} \]


Call:
lm(formula = Employed ~ GNP, data = longley)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7796 -0.5544 -0.0094  0.3436  1.4459 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 51.84359    0.68137    76.1  < 2e-16 ***
GNP          0.03475    0.00171    20.4  8.4e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.66 on 14 degrees of freedom
Multiple R-squared:  0.967, Adjusted R-squared:  0.965 
F-statistic:  415 on 1 and 14 DF,  p-value: 8.36e-12

Variance of (linear combinations of) estimates of linear parameters

The variance of \(\hat{\boldsymbol{\beta}}\) is: \[ var(\hat{\boldsymbol{\beta}})=(\boldsymbol{X^{'}X})^{-1}{\sigma}_e^2\\ var(\boldsymbol{K}\hat{\boldsymbol{\beta}})=\boldsymbol{K}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{K}^{'}{\sigma}_e^2 \] where \(\boldsymbol{K}\) is a contrast matrix.

Challenge for the class: select \(\boldsymbol{K}\) so that \(\boldsymbol{K}\hat{\boldsymbol{\beta}}\) is just the \(i^{th}\) element of \(\hat{\boldsymbol{\beta}}\)

Another challenge: in practice, we don’t know \(\sigma_e^2\) what do we do then?

Assuming normality

So far we did not make any parametric assumption. But we can assume that the residuals are Gaussianly independent and identically distributed:

\[ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{e}\\ \boldsymbol{e} \sim N(\boldsymbol{0},\boldsymbol{I}{\sigma}_e^2) \] Under these assumptions, we can obtain the Maximum Likelihood Estimate of the parameters by finding the maximum of the log-likelihood (which is the probability density function of the data seen as a function of the parameters)

Estimates of parameters under the normal linear model

\[ \boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{e}, \boldsymbol{e} \sim N(\boldsymbol{0},\boldsymbol{I}{\sigma}_e^2), \] \[ \hat{\boldsymbol{\beta}} =(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y}, \] \[ \hat{\boldsymbol{\beta}} \sim N({\boldsymbol{\beta}},(\boldsymbol{X^{'}X})^{-1}{\sigma}_e^2) \] \[ \hat{\sigma}_e^2=\frac{\boldsymbol{y}^{'}( \boldsymbol{I}-\boldsymbol{P_X} )\boldsymbol{y}} {n-p} \] These estimates, variances and expectations coincide with OLS for this model

Inference with unknown variances

\[ \hat{\boldsymbol{\beta}} \sim N({\boldsymbol{\beta}},(\boldsymbol{X^{'}X})^{-1}{\sigma}_e^2) \] For a single coefficient: \[ \hat{\beta_j} \sim N(\beta_j,d_j \sigma_e^2), \]

where \(d_j\) is the \(i^{th}\) diagonal element of \(\boldsymbol{X^{'}X})^{-1}\). With unknown residual variance: \[ \hat{\beta_j} \sim t({\beta_j},(d_j\hat{\sigma}_e^2,n-p), \]

Use to obtain confidence intervals and hypothesis tests for regression coefficients

Estimation of expected values under the Gaussian GLM

Given a new matrix \(\boldsymbol{X}_{new}\), the expected value is: \[ \hat{E(\boldsymbol{y}_{new})}=\bar{\boldsymbol{y}}_{new}=\boldsymbol{X}_{new}\hat{\boldsymbol{\beta}} \]

\[ \bar{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}}, \boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}{\sigma}_e^2) \] This result is used to obtain the confidence band of a regression. And similar to the previous slide, the estimated variance can be replaced and the distribution of a single estimated coefficient will be the student-t distribution.

Prediction of future observations under the Gaussian GLM

\[ \hat{\boldsymbol{y}}_{new}=\boldsymbol{X}_{new}\hat{\boldsymbol{\beta}} \]

\[ \hat{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}}, (\boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}+\boldsymbol{I}){\sigma}_e^2) \] This result is used to obtain the prediction band of a regression.

Estimating the mean vs predicting future observations

\[ \bar{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}}, \boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}{\sigma}_e^2) \] versus \[ \hat{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}}, (\boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}+\boldsymbol{I}){\sigma}_e^2) \] Class challenge: make at least two observations about these two expressions

Example 2: Incidence matrix for classification factors

Example 2: alternative full-rank parametrizations

Example 2: OLS equations

Example 2: Hypothesis testing

Model selection: statistical approach

Outlier detection: statistical approach

Machine learning approach to Model selection

Machine learning approach to Model selection

Machine learning approach to Model checking