Basic Linear model
\[
\color{blue}{y_i} = \sum_{j=1}^{p} {\color{red}{x_{ij}} \color{darkgreen}{\beta_j}}+\color{orange}{e_i}
\]
\[
E(\color{orange}{e_i}) = 0
\qquad(1)\]\(\color{blue}{y_i}\) observation of response variable,
\(\color{red}{x_i}\) : observation of predictor variable, \(\color{darkgreen}{\beta_j}\) : linear coefficient \(\color{orange}{e_i}\) : residual value In matrix form: \[
\boldsymbol{\color{blue}y}=\boldsymbol{\color{red}X}\boldsymbol{\color{darkgreen}\beta}+\boldsymbol{\color{orange}e}
\qquad(2)\]
\(rank(\boldsymbol{\color{red}X}) = p\) = matrix is full column rank. Class question.
Example 1: Longley data (only 6 obs)
GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
\[
Employed_i=\alpha+GNP_i \beta+ e_i,
\boldsymbol{y}=\begin{bmatrix}60.323 \\61.122 \\60.171 \\61.187 \\63.221 \\63.639 \\\end{bmatrix},
\boldsymbol{X} =\begin{bmatrix}1&234.289 \\1&259.426 \\1&258.054 \\1&284.599 \\1&328.975 \\1&346.999 \\\end{bmatrix}
\]
Model assumptions and chosing \(\beta\)
\(E(\boldsymbol{e}) = 0\), Makes no further distributional assumptions. Then: \(E(\boldsymbol{\color{blue}y})=\boldsymbol{\color{red}X}\boldsymbol{\color{darkgreen}\beta}\)
What is the best fitting line?
Model assumptions and chosing \(\beta\)
\(E(\boldsymbol{e}) = 0\), Makes no further distributional assumptions. Then: \(E(\boldsymbol{\color{blue}y})=\boldsymbol{\color{red}X}\boldsymbol{\color{darkgreen}\beta}\)
All these lines produce residuals with expectation 0:
![]()
We need another criteria to estimate \(\beta\)
Ordinary Least Squares
Set the parameters that minimize this: \(SSE=\sum_{i=1} ^n (\hat{e}_i^2) = \boldsymbol{e^{'}}\boldsymbol{e}=(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})^{'}(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})\)
![]()
This means we need to take the derivative of SSE with respect to \(\hat{\boldsymbol{\beta}}\) (not shown).
Ordinary Least Squares
Set the parameters that minimize this: \(SSE = \sum_{i=1} ^n (\hat{e}_i^2) = \boldsymbol{e^{'}}\boldsymbol{e}=(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})^{'}(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})\)
This is called ordinary the least squares estimate or OLS of \(\boldsymbol{\beta}\) And it computed as follows:
\[\hat{\boldsymbol{\beta}} =(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y}
\] Thus, the OLS of observations can be written:
\[
\hat{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}} = \boldsymbol{X}(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y} = \boldsymbol{P_{X}}\boldsymbol{y}
\]
With \(\boldsymbol{P_X}\) = Projection matrix (projects response on column space of predictors in \(\boldsymbol{X}\))
Example 1:
\[
\boldsymbol{X^{'}} \boldsymbol{X} = \begin{bmatrix} 16& 6203 \\ 6203&2553152 \\\end{bmatrix},
\boldsymbol{X^{'}} \boldsymbol{X}^{-1} = \begin{bmatrix} 1.1e+00&-2.6e-03 \\-2.6e-03& 6.7e-06 \\\end{bmatrix}
\] \[
\boldsymbol{X}^{'}\boldsymbol{y}=\begin{bmatrix} 1045 \\410323 \\\end{bmatrix},
\boldsymbol{\hat{\beta}}=\begin{bmatrix}51.844 \\ 0.035 \\\end{bmatrix}
\] Predicted y is \(\boldsymbol{\hat{y}}=\boldsymbol{X}\boldsymbol{\hat{\beta}}\) (10 obs)
\[
\begin{bmatrix}60&60.9&60.8&61.7&63.3&63.9&64.5&64.5&65.7&66.4 \\\end{bmatrix}
\] Observed y:
\[
\begin{bmatrix}60.3&61.1&60.2&61.2&63.2&63.6&65&63.8&66&67.9 \\\end{bmatrix}
\]
Properties of the projection matrix
\(\boldsymbol{P_X}\) is an important matrix in mixed models. It’s the projection matrix and we will learn other projection matrices in this course.
Symmetric: \(\boldsymbol{P_X} = \boldsymbol{P_X^{'}}\)
Idempotent: \(\boldsymbol{P_X} = \boldsymbol{P_X} \boldsymbol{P_X}\)
\(\boldsymbol{P_X} \boldsymbol{X} = \boldsymbol{X}\)
\(rank(\boldsymbol{P_X}) = rank(\boldsymbol{X}) = tr(\boldsymbol{P_X}) = p\)
\(\boldsymbol{\hat{e}}=( \boldsymbol{I}-\boldsymbol{P_X})\boldsymbol{y}\)
\(( \boldsymbol{I}-\boldsymbol{P_X})( \boldsymbol{I}-\boldsymbol{P_X})=( \boldsymbol{I}-\boldsymbol{P_X})\)
Unbiased estimator of the variance
an unbiased estimate of the error variance, \(var(\boldsymbol{e})=\sigma^2_e\) is: \[
\hat{\sigma}_e^2=\frac{\boldsymbol{y}^{'}( \boldsymbol{I}-\boldsymbol{P_X} )\boldsymbol{y}} {n-p}
\] Also, note that: \[
SSE=\boldsymbol{y}^{'}( \boldsymbol{I}-\boldsymbol{P_X} )\boldsymbol{y}
\] The proof of this requires a bit of matrix algebra, you are welcome to try it and ask me questions.
Example 1:
\[
\hat{\sigma}_e^2= 0.43,
\hat{\sigma}_e= 0.66,
\boldsymbol{\hat{\beta}}=\begin{bmatrix}51.844 \\ 0.035 \\\end{bmatrix}
\]
Call:
lm(formula = Employed ~ GNP, data = longley)
Residuals:
Min 1Q Median 3Q Max
-0.7796 -0.5544 -0.0094 0.3436 1.4459
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.84359 0.68137 76.1 < 2e-16 ***
GNP 0.03475 0.00171 20.4 8.4e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.66 on 14 degrees of freedom
Multiple R-squared: 0.967, Adjusted R-squared: 0.965
F-statistic: 415 on 1 and 14 DF, p-value: 8.36e-12
Variance of (linear combinations of) estimates of linear parameters
The variance of \(\hat{\boldsymbol{\beta}}\) is: \[
var(\hat{\boldsymbol{\beta}})=(\boldsymbol{X^{'}X})^{-1}{\sigma}_e^2\\
var(\boldsymbol{K}\hat{\boldsymbol{\beta}})=\boldsymbol{K}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{K}^{'}{\sigma}_e^2
\] where \(\boldsymbol{K}\) is a contrast matrix.
Challenge for the class: select \(\boldsymbol{K}\) so that \(\boldsymbol{K}\hat{\boldsymbol{\beta}}\) is just the \(i^{th}\) element of \(\hat{\boldsymbol{\beta}}\)
Another challenge: in practice, we don’t know \(\sigma_e^2\) what do we do then?
Assuming normality
So far we did not make any parametric assumption. But we can assume that the residuals are Gaussianly independent and identically distributed:
\[
\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{e}\\
\boldsymbol{e} \sim N(\boldsymbol{0},\boldsymbol{I}{\sigma}_e^2)
\] Under these assumptions, we can obtain the Maximum Likelihood Estimate of the parameters by finding the maximum of the log-likelihood (which is the probability density function of the data seen as a function of the parameters)
Estimates of parameters under the normal linear model
\[
\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{e},
\boldsymbol{e} \sim N(\boldsymbol{0},\boldsymbol{I}{\sigma}_e^2),
\] \[
\hat{\boldsymbol{\beta}} =(\boldsymbol{X^{'}} \boldsymbol{X})^{-1} \boldsymbol{X^{'}} \boldsymbol{y},
\] \[
\hat{\boldsymbol{\beta}} \sim N({\boldsymbol{\beta}},(\boldsymbol{X^{'}X})^{-1}{\sigma}_e^2)
\] \[
\hat{\sigma}_e^2=\frac{\boldsymbol{y}^{'}( \boldsymbol{I}-\boldsymbol{P_X} )\boldsymbol{y}} {n-p}
\] These estimates, variances and expectations coincide with OLS for this model
Inference with unknown variances
\[
\hat{\boldsymbol{\beta}} \sim N({\boldsymbol{\beta}},(\boldsymbol{X^{'}X})^{-1}{\sigma}_e^2)
\] For a single coefficient: \[
\hat{\beta_j} \sim N(\beta_j,d_j \sigma_e^2),
\]
where \(d_j\) is the \(i^{th}\) diagonal element of \(\boldsymbol{X^{'}X})^{-1}\). With unknown residual variance: \[
\hat{\beta_j} \sim t({\beta_j},(d_j\hat{\sigma}_e^2,n-p),
\]
Use to obtain confidence intervals and hypothesis tests for regression coefficients
Estimation of expected values under the Gaussian GLM
Given a new matrix \(\boldsymbol{X}_{new}\), the expected value is: \[
\hat{E(\boldsymbol{y}_{new})}=\bar{\boldsymbol{y}}_{new}=\boldsymbol{X}_{new}\hat{\boldsymbol{\beta}}
\]
\[
\bar{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}},
\boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}{\sigma}_e^2)
\] This result is used to obtain the confidence band of a regression. And similar to the previous slide, the estimated variance can be replaced and the distribution of a single estimated coefficient will be the student-t distribution.
Prediction of future observations under the Gaussian GLM
\[
\hat{\boldsymbol{y}}_{new}=\boldsymbol{X}_{new}\hat{\boldsymbol{\beta}}
\]
\[
\hat{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}},
(\boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}+\boldsymbol{I}){\sigma}_e^2)
\] This result is used to obtain the prediction band of a regression.
Estimating the mean vs predicting future observations
\[
\bar{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}},
\boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}{\sigma}_e^2)
\] versus \[
\hat{\boldsymbol{y}}_{new} \sim N(\boldsymbol{X}_{new}{\boldsymbol{\beta}},
(\boldsymbol{X}_{new}(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X}_{new}^{'}+\boldsymbol{I}){\sigma}_e^2)
\] Class challenge: make at least two observations about these two expressions
Example 2: Incidence matrix for classification factors
Example 2: alternative full-rank parametrizations
Example 2: Hypothesis testing
Model selection: statistical approach
Outlier detection: statistical approach
Machine learning approach to Model selection
Machine learning approach to Model selection
Machine learning approach to Model checking