LMT - Equations and Terms Cheat Sheet

1. Simple Linear Regression (SLR) Model

Let \((x_1, Y_1), (x_2, Y_2), ..., (x_n, Y_n)\) be \(n\) bivariate data points observed from a simple linear model \[Y_i=\beta_0+\beta_1 x_i+\varepsilon_i, i=1, 2, ..., n\]

1.2 Normality Assumption

\[\varepsilon_i \sim Normal(0, \sigma^2)\]

Errors are in independent and identically distributed (iid)

Normally distributed
Mean zero
Constant spread (\(\sigma^2\)), homoscedastic
Indepent (or at least uncorrelated)

1.3 Parameters and Estimators

Least square estimators, \(\hat{\beta}_0\) and \(\hat{\beta}_1\), are found by minimizing \(SS(Res)=\sum_{i=1}^n (Y_i-\hat{Y}_i)^2\)

Type	Parameter	Estimator
Y-intercept	\(\beta_0\)	\(\hat{\beta}_0=\bar{Y}-\hat{\beta}_1 \bar{x}\)
Slope	\(\beta_1\)	\(\hat{\beta}_1 =\frac{n\sum_{i=1}^n x_iY_i-\sum_{i=1}^n x_i \sum_{i=1}^n Y_i}{n\sum_{i=1}^n x_i^2-(\sum_{i=1}^n x_i)^2}=\frac{\sum_{i=1}^n (x_i-\bar{x})(Y_i-\bar{Y})}{\sum_{i=1}^n (x_i-\bar{x})^2}=\frac{\sum_{i=1}^n (x_i-\bar{x})Y_i}{\sum_{i=1}^n (x_i-\bar{x})^2} = r\times \frac{s_Y}{s_x}\)
Variance	\(\sigma^2\)	\(s^2=MS(Res)=SS(Res)/n-2\)

where

Sample mean of \(x\): \(\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\)
Sample mean of \(Y\): \(\bar{Y}=\frac{1}{n}\sum_{i=1}^n Y_i\)
Sample standard deviation of \(x\): \(s_x=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}\)
Sample standard deviation of \(Y\): \(s_Y=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (Y_i-\bar{Y})^2}\)
Correlation coefficient: \(r=\frac{1}{n-1}\sum_{i=1}(\frac{(x_i-\bar{x})}{s_x})(\frac{(Y_i-\bar{Y})}{s_Y})\)

1.3.1 Properties of Estimators

Slope:

\[\hat{\beta}_1 \sim Normal(\beta_1, \frac{1}{\sum_{i=1}^n (x_i-\bar{x})^2} \sigma^2)\]

Intercept:

\[\hat{\beta}_0 \sim Normal(\beta_0, (\frac{1}{n}+\frac{1}{\sum_{i=1}^n (x_i-\bar{x})^2}) \sigma^2)\]

1.3.2 Properties of Residuals

The \(i^{th}\) residual is given by \[\hat{e}_i=Y_i-\hat{Y}_i\]

Key Terms

Residual sum of squares: \(SS(Res)=\sum_{i=1}^n e_i^2=\sum_{i=1}^n (Y_i -\hat{Y}_i)^2\)
Residual mean squared error: \(MS(Res)=\frac{SS(Res)}{n-2}\)

Residual restrictions:

\(\sum_{i=1}^n e_i = \sum_{i=1}^n (Y_i-\hat{Y}_i)=0\)
\(\sum_{i=1}^n x_ie_i=\sum_{i=1}^n x_i(Y_i-\hat{Y}_i)=0\)

Properties:

\(E[SS(Res)]=(n-2)\sigma^2\)
- \(E[MS(Res)]=\sigma^2\), unbiased estimator
\(SS(Res)/\sigma^2\) has a \(\chi^2\) distribution with \(n-2\) degrees of freedom
\(SS(Res)\) is independent of both the least squares estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

1.4 Important Distributions for Regression

1.4.1 Normal: \(X\sim Normal(\mu, \sigma^2)\)

Parameters:

\(\mu\): Mean
\(\sigma^2\) : Variance

Properties:

Expected value, \(E[X]=\mu\)
Variance, \(V[X]=\sigma^2\)
Centered around the mean
Symmetric

Standard Normal: \(Z\sim Normal(\mu=0, \sigma^2=1)\)

1.4.2 Chi-Squared: \(V\sim \chi^2(df=\nu)\)

Parameters:

\(\nu\): Degrees of freedom

Properties:

Expected value, \(E[X]=\nu\)
Variance, \(V[X]=2 \nu\)
The square of a standard normal distribution, \(Y=Z^2\), is \(\chi^2(df=1)\)
The sum of indepedent \(\chi^2\)’s is a \(\chi^2\) with the degrees of freedom summed

1.4.3 Student’s T: \(T\sim t_{df=\nu}\)

Parameters:

\(\nu\): Degrees of freedom

Properties:

Expected value, \(E[X]=0\)
Variance: Don’t worry about it
Converages to the standard normal distribution
Ratio between a standard normal and the square-root of a chi-squared, \(T=\frac{Z}{\sqrt{V/\nu}}\)

1.4.4 F-distribution: \(F\sim F(\nu_1, \nu_2)\)

Parameters:

\(\nu_1\): Numerator degrees of freedom
\(\nu_2\): Denominator degrees of freedom

Properties:

The ratio of two chi-squared rvs, \(F=\frac{U/\nu_1}{V/\nu_2}\)

1.5 Hypothesis Tests

1.5.1 Test for Slope

Hypotheses:

Two-sided (nondirectional): \(H_0:\beta_1=\beta_{1,0}\) vs \(H_A:\beta_1\neq \beta_{1,0}\)
Right-sided/One-sided Upper: \(H_0:\beta_1=\beta_{1,0}\) vs \(H_A:\beta_1> \beta_{1,0}\)
Left-sided/One-sided Lower: \(H_0:\beta_1=\beta_{1,0}\) vs \(H_A:\beta_1< \beta_{1,0}\)

Reference distribution:

Student t-distribution with \(n-2\) degrees of freedom

Test Statistic

\[t=\frac{(\hat{\beta}_1-\beta_{1,0})\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}}{\sqrt{SS(Res)/n-2}}=\frac{(\hat{\beta}_1-\beta_{1,0})\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}}{\sqrt{MS(Res)}}\]

We can simplify this to say: \[t=\frac{\hat{\beta}_1-\beta_{1,0}}{SE(\hat{\beta}_1)}\]

Where \(SE(\hat{\beta}_1)\) is the standard error for the slope estimator and is given by \[SE(\hat{\beta}_1)=\frac{\sqrt{SS(Res)/(n-2)}}{\sqrt{\sum_{i=1}^n (x_i -\bar{x})^2}}=\frac{\sqrt{MS(Res)}}{\sqrt{\sum_{i=1}^n (x_i -\bar{x})^2}}=\frac{s}{\sqrt{\sum_{i=1}^n (x_i -\bar{x})^2}}\]

Typically, we use \(\beta_{1,0}=0\) to test for the significance of a relationship between our response (\(Y\)) and explantory (\(x\)) variables.

Thus, the test statistic is:

\[t=\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}\]

P-values

Two-sided (nondirectional): \(2\times Pr(t_{n-2} \geq |\text{test stat}|)\)
Right-sided/One-sided Upper: \(Pr(t_{n-2} \geq \text{test stat})\)
Left-sided/One-sided Lower: \(Pr(t_{n-2} \leq \text{test stat})\)

1.6 Confidence Intervals

1.6.1 Confidence Interval for Slope, \(\beta_1\)

1.6.2 Confidence Interval for Intercept, \(\beta_0\)

1.7 Prediction

Once the model is fitted we can use it to make predictions on the value of the response variable \(Y\), for any given value of \(x=x_0\).

1.7.1 Point Estimate

We simple input \(x=x_0\) into the fitted equation \[\hat{\beta}_0+\hat{\beta}_1 x_0\]

1.7.2 Confidence Interval for the Mean Response

\[\mu_{Y|x_0}=E[Y|x_0]=\beta_0+\beta_1 x_0\]

The form of a \(100(1-\alpha)%\) confidence interval for the mean response is given by

\[(\hat{\beta}_0+\hat{\beta}_1 x_0)\pm t_{df=n-2, \alpha/2}^* \times \sqrt{MS(Res)\times (\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2})}\]

1.7.3 Prediction Interval for a New Response

A confidence interval for the mean response does not reflecct the bounds in which we are realisitically expecting to observe a single new observation at \(x=x_0\). A single observation has more variablity the average of observations. Therefore, we account for that by changing the error term.

The form of a \(100(1-\alpha)%\) prediction interval for the mean response is given by

\[(\hat{\beta}_0+\hat{\beta}_1 x_0)\pm t_{df=n-2, \alpha/2}^* \times \sqrt{MS(Res)\times (1+\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2})}\]

2. Multiple Linear Regression (MLR) Model

Let \(Y\) be a response variable which can be possible explained by predictors \(x_1, x_2, ..., x_p\). We say there is a linear model that describes a potential relationship between \(Y\) and \(x_1, x_2, ..., x_p\) if this relationship can be expressed as:

\[Y_i=\beta_0+\beta_1 x_{i, 1}+\beta_2 x_{i, 2}+ ... +\beta_p x_{i, p}+ \varepsilon_i\]

Where

\(\beta_0, \beta_1, ..., \beta_p\) are unknown parameters (coefficients)
As before, we assume that the values of \(x_{i,j}\)’s are constant, non-random
Errors: \(\varepsilon_1, ..., \varepsilon_n \sim \text{iid } Normal(0, \sigma^2)\)

2.1 Expressing the model with matrices

2.1.1 Notation

Response Vector

\(\textbf{Y}_{n\times 1}=\begin{bmatrix} Y_1\\ Y_2\\ .\\ .\\ .\\ Y_n \end{bmatrix}\)

Design Matrix

\(\textbf{X}_{n\times (p+1)}=\begin{bmatrix} 1 & x_{1,1} & x_{1,2}& .& .& .& x_{1,p}\\ 1 & x_{2,1} & x_{2,2}& .&.& .& x_{2,p}\\ . & . & .& .& & &.\\ . & . & .& &.& & .\\ . & . & .& && .& .\\ 1 & x_{n,1} & x_{n,2}& .&.& .& x_{n,p}\\ \end{bmatrix}\)

Coefficient (Parameter) Vector

\(\boldsymbol \beta_{(p+1)\times 1}=\begin{bmatrix} \beta_0\\ \beta_1\\ .\\ .\\ .\\ \beta_p \end{bmatrix}\)

Error Vector

\(\boldsymbol \varepsilon_{n\times 1}=\begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ .\\ .\\ .\\ \varepsilon_n \end{bmatrix}\)

Thus we can express the model using matrices as

\[\textbf{Y}_{n\times 1}=\textbf{X}_{n\times (p+1)}\boldsymbol \beta_{(p+1)\times 1}+\boldsymbol \varepsilon_{n\times 1}\]

2.1.2 Least Squares Estimator

The least squares estimator of \(\beta\) is given by

\[\hat{\boldsymbol \beta}=(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{Y}\]

Thus the least squares regression equation is given by:

\[\hat{\textbf{Y}}=\textbf{X}\hat{\boldsymbol \beta}=\textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{Y}\]

2.2 The Hat (Projection) Matrix

The \(n\times n\) matrix, known as the hat matrix, is given by

\[\textbf{H}=\textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\]

2.2.1 Properties

The least squares predicted values \(\hat{\textbf{Y}}=\textbf{H}\textbf{Y}\) can be considered as the image of \(\textbf{Y}\) under the projection \(\textbf{H}\)

The hat matrix is:

Idempotent (\(\textbf{H}\textbf{H}=\textbf{H}^2=\textbf{H}\))
Symmetric (\(\textbf{H}=\textbf{H}^T\))

The deviation between \({\textbf{Y}}\) and \(\hat{\textbf{Y}}\) is the vector of residuals.

\[{\textbf{e}}=\textbf{Y}-\hat{\textbf{Y}}=\textbf{Y}-\textbf{H}\textbf{Y}=(\textbf{I}-\textbf{H})\textbf{Y}\]

It can be shown that \((\textbf{I}-\textbf{H})^2=\textbf{I}-\textbf{H}\) and \((\textbf{I}-\textbf{H})\textbf{H}=0\). Therefore, the matrix \((\textbf{I}-\textbf{H})\) is also idempotent and it is a projection orthogonal to hat matrix \(\textbf{H}\).

2.3 Foundations of MLR Inference

2.3.1 Multivariate Normal

A random vector \(\textbf{Y}^T=[Y_1, ..., Y_n]\) is said to have a multivariate normal distribution with mean parameter vector \(\boldsymbol \mu^T=[\mu_1, ..., \mu_n]\) and variance-covariance matrix \(\boldsymbol \Sigma\) , if it has the joint probability density function

\[f(Y_1, ..., Y_n)=f(\textbf{Y})=(\frac{1}{\sqrt{2 \pi}})^n \frac{1}{det(\boldsymbol \Sigma)}^{1/2} e^{-\frac{1}{2}(\textbf{Y}-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1}(\textbf{Y}-\boldsymbol \mu)}\]

Notation: \[\textbf{Y}\sim MVN(\boldsymbol \mu, \boldsymbol \Sigma)\]

\(MVN\): Multivariate Normal

Variance-Covariance Matrix, \(\boldsymbol \Sigma\)

The \((i,j)^{th}\) entry of the variance-covariance matrix \(\boldsymbol \Sigma\) is

\[\sigma_{i,j}=Cov(Y_i, Y_j)=\rho_{i,j}\sigma_i \sigma_j\]

where \(\rho_{i,j}\) is the correlation coefficient bettern \(Y_i\) and \(Y_j\)

\(\boldsymbol \Sigma=\begin{bmatrix} \sigma_1^2 & \sigma_{1,2}&\sigma_{1,3}& .& .& .& \sigma_{1,n}\\ \sigma_{2,1}& \sigma_2^2 & \sigma_{2,3}& .&.& .& \sigma_{2,n}\\ . & . & .& .& & &.\\ . & . & .& &.& & .\\ . & . & .& && .& .\\ \sigma_{n,1} & \sigma_{n,2} & \sigma_{n,3}& .&.& .& \sigma_{n}^2\\ \end{bmatrix}\)

Note that:

\(Var(Y_i)=Cov(Y_i, Y_i)=\sigma_{i,i}=\sigma_i^2\)
\(Cov(Y_i, Y_j)=\sigma_{i,j}=\sigma_{j,i}=Cov(Y_j, Y_i)\) (Symmetric)

Properties

Let random vector \(\textbf{Y}^T=[Y_1, ..., Y_n]\) have a multivariate normal distribution with mean vector \(\boldsymbol \mu\) and variance-covariance matrix \(\boldsymbol \Sigma\). Let \(\textbf{A}\) be a \(p\times n\) matrix.

Then

Expectation: \(E[\textbf{AY}]=\textbf{A} \boldsymbol \mu\)
Variance: \(Cov[\textbf{AY}]=\textbf{A} Cov(\textbf{Y})\textbf{AY}^T=\textbf{A} \boldsymbol \Sigma \textbf{A}^T\)

Thus the distribution of \(\textbf{AY}\) is

\[\textbf{AY}\sim MVN(\textbf{A} \boldsymbol \mu, \textbf{A} \boldsymbol \Sigma \textbf{A}^T)\]

2.3.1 Individual Tests

\[\hat{\boldsymbol \beta} \sim MVN(\boldsymbol \beta, \sigma^2(\textbf{X}^T\textbf{X})^{-1})\]

then the distribution of an individual \(\hat{\beta}_j\) is given by

\[\hat{\beta}_j \sim N(\beta_j, \sigma^2 C_{j,j})\]

where \(C_{j,j}\) is the \(j^{th}\) diagonal entry o the \((\textbf{X}^T\textbf{X})^{-1}\) matrix.

As before we estimate \(\sigma^2\) with \(MS(Res)\), which under MLR is given by \(\hat{\sigma}^2=s^2=MS(Res)=SS(Res)/(n-p-1)\).

Thus, the standard error of \(\hat{\beta}_j\) is \[se(\hat{\beta}_j)=\hat{\sigma}\sqrt{C_{j,j}}\]

Then the test for an individual slope is

\[H_0: \beta_j=0\]

\[H_A: \beta_j \neq 0\] The test statistic is

\[t_j=\frac{\hat{\beta}_j}{se(\hat{\beta}_j)}=\frac{\hat{\beta}_j}{\hat{\sigma}\sqrt{C_{j,j}}}\]

The confidence interval is

\[\hat{\beta}_j \pm t^*_{df=n-p-1, \alpha/2} \times se(\hat{\beta}_j)\]

Mutliple Comparisons

So far we have considered individual hypothesis tests and confidence intervals for regression coefficients; however, if we perform multiple tests this will result in calibration errors!

If we want to construct “simultaneous” hypothesis tests and/or confidence intervals for \(\beta_1, \beta_2, ..., \beta_p\) we have to consider adjustments for multiple comparisons.

Bonferroni Correction

Adjust Type 1 error to \(\alpha/k\)

3. ANOVA

The following decomposition of the source of variance, or analysis of variance (ANOVA)

\[\sum_{i=1}^n(Y_i-\bar{Y})^2=\sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2+\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\]

Where we have the following sum of squares:

\(SS(Total)=\sum_{i=1}^n(Y_i-\bar{Y})^2\)
\(SS(Regression)=\sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2\)
\(SS(Residual)=\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\)

3.0,1 Quaratic Forms

Additional notation to define:

\(\textbf{1}_{n\times 1}\): Column vector of ones
\(\textbf{J}_{n\times n}=\textbf{1}\textbf{1}^T\): Matrix of ones
\(\textbf{I}_{n \times n}\): Identity matrix

Sum of Squares

\(SS(Total)=\textbf{Y}^T(\textbf{I}-\frac{1}{n}\textbf{J})\textbf{Y}\)
\(SS(Reg)=\textbf{Y}^T(\textbf{H}-\frac{1}{n}\textbf{J})\textbf{Y}\)
\(SS(Res)=\textbf{Y}^T(\textbf{I}-\textbf{H})\textbf{Y}\)

3.1 Anova Table

Source	DF	Sum of Squares	Mean Squares	F-value	P-value
Regression	\(p\)	\(SS(Reg)\)	\(MS(Reg)=SS(Reg)/p\)	\(MS(Reg)/MS(Res)\)	pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)
Residual	\(n-p-1\)	\(SS(Res)\)	\(MS(Res)=SS(Res)/(n-p-1)\)	—	—
Total	\(n-1\)	\(SS(Tot)\)	—	—	—

3.2 R-squared

The proportion of variability described by the model:

\[R^2=\frac{SS(Reg)}{SS(Tot)}=1-\frac{SS(Res)}{SS(Tot)}\]

3.3 Distributions of Sum of Squares

\(\frac{SS(Res)}{\sigma^2} \sim \chi^2 (df=n-p-1)\)
\(\frac{SS(Reg)}{\sigma^2} \sim \chi^2 (df=p)\)
\(\frac{SS(Tot)}{\sigma^2} \sim \chi^2 (df=n-1)\)

3.4 F-tests

3.4.1 Global F-test

Hypotheses:

\[H_0: \beta_1=\beta_2=...=\beta_p=0\]

\[H_A: \text{at least one } \beta_j\neq 0\]

Test Statistic:

\[F=\frac{\frac{SS(Reg)}{p}}{\frac{SS(Res)}{n-p-1}}=\frac{MS(Reg)}{MS(Res)}\]

Reference Distribution: \(F_{df1=p, df2=n-p-1}\)

3.4.2 General Linear Hypothesis

Hypotheses

\[H_0: \textbf{k}^T \boldsymbol \beta = \textbf{m}\]

\[H_A: \textbf{k}^T \boldsymbol \beta \neq \textbf{m}\]

Where

\(k^T\) is a \(r \times (p+1)\) matrix of rank
\(m\) is a \(r \times 1\) vector of constants (this represents a linear combination)

Under the null

\[(\textbf{k}^T \boldsymbol \beta - \textbf{m}) \sim MVN(\textbf{k}^T \boldsymbol \beta - \textbf{m}, \sigma^2 \textbf{k}^T(\textbf{X}^T\textbf{X})^{-1}\textbf{k})\]

Then we have the following distribution

\[(\textbf{k}^T \boldsymbol \beta - \textbf{m})^T[\sigma^2 \textbf{k}^T(\textbf{X}^T\textbf{X})^{-1}\textbf{k})]^{-1}(\textbf{k}^T \boldsymbol \beta - \textbf{m})\sim \chi^2_{df=r}\] Which yeilds the following quadratic form

\[\text{Q}=(\textbf{k}^T \boldsymbol \beta - \textbf{m})^T[\textbf{k}^T(\textbf{X}^T\textbf{X})^{-1}\textbf{k})]^{-1}(\textbf{k}^T \boldsymbol \beta - \textbf{m})\]

The the F-test is given by

\[F=\frac{\textbf{Q}/r}{s^2}\sim F_{df1=r, df2=n-p-1}\]

The general linear hypothesis can be used to compare full and reduced models, which is useful for variable selection.

3.4.3 Variable Selection

In this class we discussed:

Forward selection (starts will an empty/null model)
Backward elmination (starts with a saturated model)
Best subset selection

LMT - Equations and Terms Cheat Sheet

Heather Kitada Smalley

1. Simple Linear Regression (SLR) Model

1.2 Normality Assumption

1.3 Parameters and Estimators

1.3.1 Properties of Estimators

Slope:

Intercept:

1.3.2 Properties of Residuals

Key Terms

Residual restrictions:

Properties:

1.4 Important Distributions for Regression

1.4.1 Normal: \(X\sim Normal(\mu, \sigma^2)\)

Standard Normal: \(Z\sim Normal(\mu=0, \sigma^2=1)\)

1.4.2 Chi-Squared: \(V\sim \chi^2(df=\nu)\)

1.4.3 Student’s T: \(T\sim t_{df=\nu}\)

1.4.4 F-distribution: \(F\sim F(\nu_1, \nu_2)\)

1.5 Hypothesis Tests

1.5.1 Test for Slope

Hypotheses:

Reference distribution:

Test Statistic

P-values

1.6 Confidence Intervals

1.6.1 Confidence Interval for Slope, \(\beta_1\)

1.6.2 Confidence Interval for Intercept, \(\beta_0\)

1.7 Prediction

1.7.1 Point Estimate

1.7.2 Confidence Interval for the Mean Response

1.7.3 Prediction Interval for a New Response

2. Multiple Linear Regression (MLR) Model

2.1 Expressing the model with matrices

2.1.1 Notation

Response Vector

Design Matrix

Coefficient (Parameter) Vector

Error Vector

2.1.2 Least Squares Estimator

2.2 The Hat (Projection) Matrix

2.2.1 Properties

2.3 Foundations of MLR Inference

2.3.1 Multivariate Normal

Variance-Covariance Matrix, \(\boldsymbol \Sigma\)

Properties

2.3.1 Individual Tests

Mutliple Comparisons

Bonferroni Correction

3. ANOVA

3.0,1 Quaratic Forms

Sum of Squares

3.1 Anova Table

3.2 R-squared

3.3 Distributions of Sum of Squares

3.4 F-tests

3.4.1 Global F-test

3.4.2 General Linear Hypothesis

3.4.3 Variable Selection