1. Simple Linear Regression (SLR) Model

Let \((x_1, Y_1), (x_2, Y_2), ..., (x_n, Y_n)\) be \(n\) bivariate data points observed from a simple linear model \[Y_i=\beta_0+\beta_1 x_i+\varepsilon_i, i=1, 2, ..., n\]

1.2 Normality Assumption

\[\varepsilon_i \sim Normal(0, \sigma^2)\]

Errors are in independent and identically distributed (iid)

  • Normally distributed
  • Mean zero
  • Constant spread (\(\sigma^2\)), homoscedastic
  • Indepent (or at least uncorrelated)

1.3 Parameters and Estimators

Least square estimators, \(\hat{\beta}_0\) and \(\hat{\beta}_1\), are found by minimizing \(SS(Res)=\sum_{i=1}^n (Y_i-\hat{Y}_i)^2\)

Type Parameter Estimator
Y-intercept \(\beta_0\) \(\hat{\beta}_0=\bar{Y}-\hat{\beta}_1 \bar{x}\)
Slope \(\beta_1\) \(\hat{\beta}_1 =\frac{n\sum_{i=1}^n x_iY_i-\sum_{i=1}^n x_i \sum_{i=1}^n Y_i}{n\sum_{i=1}^n x_i^2-(\sum_{i=1}^n x_i)^2}=\frac{\sum_{i=1}^n (x_i-\bar{x})(Y_i-\bar{Y})}{\sum_{i=1}^n (x_i-\bar{x})^2}=\frac{\sum_{i=1}^n (x_i-\bar{x})Y_i}{\sum_{i=1}^n (x_i-\bar{x})^2} = r\times \frac{s_Y}{s_x}\)
Variance \(\sigma^2\) \(s^2=MS(Res)=SS(Res)/n-2\)

where

  • Sample mean of \(x\): \(\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\)
  • Sample mean of \(Y\): \(\bar{Y}=\frac{1}{n}\sum_{i=1}^n Y_i\)
  • Sample standard deviation of \(x\): \(s_x=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}\)
  • Sample standard deviation of \(Y\): \(s_Y=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (Y_i-\bar{Y})^2}\)
  • Correlation coefficient: \(r=\frac{1}{n-1}\sum_{i=1}(\frac{(x_i-\bar{x})}{s_x})(\frac{(Y_i-\bar{Y})}{s_Y})\)

1.3.1 Properties of Estimators

Slope:

\[\hat{\beta}_1 \sim Normal(\beta_1, \frac{1}{\sum_{i=1}^n (x_i-\bar{x})^2} \sigma^2)\]

Intercept:

\[\hat{\beta}_0 \sim Normal(\beta_0, (\frac{1}{n}+\frac{1}{\sum_{i=1}^n (x_i-\bar{x})^2}) \sigma^2)\]

1.3.2 Properties of Residuals

The \(i^{th}\) residual is given by \[\hat{e}_i=Y_i-\hat{Y}_i\]

Key Terms
  • Residual sum of squares: \(SS(Res)=\sum_{i=1}^n e_i^2=\sum_{i=1}^n (Y_i -\hat{Y}_i)^2\)
  • Residual mean squared error: \(MS(Res)=\frac{SS(Res)}{n-2}\)
Residual restrictions:
  • \(\sum_{i=1}^n e_i = \sum_{i=1}^n (Y_i-\hat{Y}_i)=0\)
  • \(\sum_{i=1}^n x_ie_i=\sum_{i=1}^n x_i(Y_i-\hat{Y}_i)=0\)
Properties:
  • \(E[SS(Res)]=(n-2)\sigma^2\)

    • \(E[MS(Res)]=\sigma^2\), unbiased estimator
  • \(SS(Res)/\sigma^2\) has a \(\chi^2\) distribution with \(n-2\) degrees of freedom
  • \(SS(Res)\) is independent of both the least squares estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

1.4 Important Distributions for Regression

1.4.1 Normal: \(X\sim Normal(\mu, \sigma^2)\)

Parameters:

  • \(\mu\): Mean
  • \(\sigma^2\) : Variance

Properties:

  • Expected value, \(E[X]=\mu\)
  • Variance, \(V[X]=\sigma^2\)
  • Centered around the mean
  • Symmetric
Standard Normal: \(Z\sim Normal(\mu=0, \sigma^2=1)\)

1.4.2 Chi-Squared: \(V\sim \chi^2(df=\nu)\)

Parameters:

  • \(\nu\): Degrees of freedom

Properties:

  • Expected value, \(E[X]=\nu\)
  • Variance, \(V[X]=2 \nu\)
  • The square of a standard normal distribution, \(Y=Z^2\), is \(\chi^2(df=1)\)
  • The sum of indepedent \(\chi^2\)’s is a \(\chi^2\) with the degrees of freedom summed

1.4.3 Student’s T: \(T\sim t_{df=\nu}\)

Parameters:

  • \(\nu\): Degrees of freedom

Properties:

  • Expected value, \(E[X]=0\)
  • Variance: Don’t worry about it
  • Converages to the standard normal distribution
  • Ratio between a standard normal and the square-root of a chi-squared, \(T=\frac{Z}{\sqrt{V/\nu}}\)

1.4.4 F-distribution: \(F\sim F(\nu_1, \nu_2)\)

Parameters:

  • \(\nu_1\): Numerator degrees of freedom
  • \(\nu_2\): Denominator degrees of freedom

Properties:

  • The ratio of two chi-squared rvs, \(F=\frac{U/\nu_1}{V/\nu_2}\)

1.5 Hypothesis Tests

1.5.1 Test for Slope

Hypotheses:
  • Two-sided (nondirectional): \(H_0:\beta_1=\beta_{1,0}\) vs \(H_A:\beta_1\neq \beta_{1,0}\)
  • Right-sided/One-sided Upper: \(H_0:\beta_1=\beta_{1,0}\) vs \(H_A:\beta_1> \beta_{1,0}\)
  • Left-sided/One-sided Lower: \(H_0:\beta_1=\beta_{1,0}\) vs \(H_A:\beta_1< \beta_{1,0}\)
Reference distribution:

Student t-distribution with \(n-2\) degrees of freedom

Test Statistic

\[t=\frac{(\hat{\beta}_1-\beta_{1,0})\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}}{\sqrt{SS(Res)/n-2}}=\frac{(\hat{\beta}_1-\beta_{1,0})\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}}{\sqrt{MS(Res)}}\]

We can simplify this to say: \[t=\frac{\hat{\beta}_1-\beta_{1,0}}{SE(\hat{\beta}_1)}\]

Where \(SE(\hat{\beta}_1)\) is the standard error for the slope estimator and is given by \[SE(\hat{\beta}_1)=\frac{\sqrt{SS(Res)/(n-2)}}{\sqrt{\sum_{i=1}^n (x_i -\bar{x})^2}}=\frac{\sqrt{MS(Res)}}{\sqrt{\sum_{i=1}^n (x_i -\bar{x})^2}}=\frac{s}{\sqrt{\sum_{i=1}^n (x_i -\bar{x})^2}}\]

Typically, we use \(\beta_{1,0}=0\) to test for the significance of a relationship between our response (\(Y\)) and explantory (\(x\)) variables.

Thus, the test statistic is:

\[t=\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}\]

P-values
  • Two-sided (nondirectional): \(2\times Pr(t_{n-2} \geq |\text{test stat}|)\)
  • Right-sided/One-sided Upper: \(Pr(t_{n-2} \geq \text{test stat})\)
  • Left-sided/One-sided Lower: \(Pr(t_{n-2} \leq \text{test stat})\)

1.6 Confidence Intervals

1.6.1 Confidence Interval for Slope, \(\beta_1\)

1.6.2 Confidence Interval for Intercept, \(\beta_0\)

1.7 Prediction

Once the model is fitted we can use it to make predictions on the value of the response variable \(Y\), for any given value of \(x=x_0\).

1.7.1 Point Estimate

We simple input \(x=x_0\) into the fitted equation \[\hat{\beta}_0+\hat{\beta}_1 x_0\]

1.7.2 Confidence Interval for the Mean Response

\[\mu_{Y|x_0}=E[Y|x_0]=\beta_0+\beta_1 x_0\]

The form of a \(100(1-\alpha)%\) confidence interval for the mean response is given by

\[(\hat{\beta}_0+\hat{\beta}_1 x_0)\pm t_{df=n-2, \alpha/2}^* \times \sqrt{MS(Res)\times (\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2})}\]

1.7.3 Prediction Interval for a New Response

A confidence interval for the mean response does not reflecct the bounds in which we are realisitically expecting to observe a single new observation at \(x=x_0\). A single observation has more variablity the average of observations. Therefore, we account for that by changing the error term.

The form of a \(100(1-\alpha)%\) prediction interval for the mean response is given by

\[(\hat{\beta}_0+\hat{\beta}_1 x_0)\pm t_{df=n-2, \alpha/2}^* \times \sqrt{MS(Res)\times (1+\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2})}\]

2. Multiple Linear Regression (MLR) Model

Let \(Y\) be a response variable which can be possible explained by predictors \(x_1, x_2, ..., x_p\). We say there is a linear model that describes a potential relationship between \(Y\) and \(x_1, x_2, ..., x_p\) if this relationship can be expressed as:

\[Y_i=\beta_0+\beta_1 x_{i, 1}+\beta_2 x_{i, 2}+ ... +\beta_p x_{i, p}+ \varepsilon_i\]

Where

2.1 Expressing the model with matrices

2.1.1 Notation

Response Vector

\(\textbf{Y}_{n\times 1}=\begin{bmatrix} Y_1\\ Y_2\\ .\\ .\\ .\\ Y_n \end{bmatrix}\)

Design Matrix

\(\textbf{X}_{n\times (p+1)}=\begin{bmatrix} 1 & x_{1,1} & x_{1,2}& .& .& .& x_{1,p}\\ 1 & x_{2,1} & x_{2,2}& .&.& .& x_{2,p}\\ . & . & .& .& & &.\\ . & . & .& &.& & .\\ . & . & .& && .& .\\ 1 & x_{n,1} & x_{n,2}& .&.& .& x_{n,p}\\ \end{bmatrix}\)

Coefficient (Parameter) Vector

\(\boldsymbol \beta_{(p+1)\times 1}=\begin{bmatrix} \beta_0\\ \beta_1\\ .\\ .\\ .\\ \beta_p \end{bmatrix}\)

Error Vector

\(\boldsymbol \varepsilon_{n\times 1}=\begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ .\\ .\\ .\\ \varepsilon_n \end{bmatrix}\)

Thus we can express the model using matrices as

\[\textbf{Y}_{n\times 1}=\textbf{X}_{n\times (p+1)}\boldsymbol \beta_{(p+1)\times 1}+\boldsymbol \varepsilon_{n\times 1}\]

2.1.2 Least Squares Estimator

The least squares estimator of \(\beta\) is given by

\[\hat{\boldsymbol \beta}=(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{Y}\]

Thus the least squares regression equation is given by:

\[\hat{\textbf{Y}}=\textbf{X}\hat{\boldsymbol \beta}=\textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{Y}\]

2.2 The Hat (Projection) Matrix

The \(n\times n\) matrix, known as the hat matrix, is given by

\[\textbf{H}=\textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\]

2.2.1 Properties

The least squares predicted values \(\hat{\textbf{Y}}=\textbf{H}\textbf{Y}\) can be considered as the image of \(\textbf{Y}\) under the projection \(\textbf{H}\)

The hat matrix is:

  • Idempotent (\(\textbf{H}\textbf{H}=\textbf{H}^2=\textbf{H}\))
  • Symmetric (\(\textbf{H}=\textbf{H}^T\))

The deviation between \({\textbf{Y}}\) and \(\hat{\textbf{Y}}\) is the vector of residuals.

\[{\textbf{e}}=\textbf{Y}-\hat{\textbf{Y}}=\textbf{Y}-\textbf{H}\textbf{Y}=(\textbf{I}-\textbf{H})\textbf{Y}\]

  • It can be shown that \((\textbf{I}-\textbf{H})^2=\textbf{I}-\textbf{H}\) and \((\textbf{I}-\textbf{H})\textbf{H}=0\). Therefore, the matrix \((\textbf{I}-\textbf{H})\) is also idempotent and it is a projection orthogonal to hat matrix \(\textbf{H}\).

2.3 Foundations of MLR Inference

2.3.1 Multivariate Normal

A random vector \(\textbf{Y}^T=[Y_1, ..., Y_n]\) is said to have a multivariate normal distribution with mean parameter vector \(\boldsymbol \mu^T=[\mu_1, ..., \mu_n]\) and variance-covariance matrix \(\boldsymbol \Sigma\) , if it has the joint probability density function

\[f(Y_1, ..., Y_n)=f(\textbf{Y})=(\frac{1}{\sqrt{2 \pi}})^n \frac{1}{det(\boldsymbol \Sigma)}^{1/2} e^{-\frac{1}{2}(\textbf{Y}-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1}(\textbf{Y}-\boldsymbol \mu)}\]

Notation: \[\textbf{Y}\sim MVN(\boldsymbol \mu, \boldsymbol \Sigma)\]

\(MVN\): Multivariate Normal

Variance-Covariance Matrix, \(\boldsymbol \Sigma\)

The \((i,j)^{th}\) entry of the variance-covariance matrix \(\boldsymbol \Sigma\) is

\[\sigma_{i,j}=Cov(Y_i, Y_j)=\rho_{i,j}\sigma_i \sigma_j\]

where \(\rho_{i,j}\) is the correlation coefficient bettern \(Y_i\) and \(Y_j\)

\(\boldsymbol \Sigma=\begin{bmatrix} \sigma_1^2 & \sigma_{1,2}&\sigma_{1,3}& .& .& .& \sigma_{1,n}\\ \sigma_{2,1}& \sigma_2^2 & \sigma_{2,3}& .&.& .& \sigma_{2,n}\\ . & . & .& .& & &.\\ . & . & .& &.& & .\\ . & . & .& && .& .\\ \sigma_{n,1} & \sigma_{n,2} & \sigma_{n,3}& .&.& .& \sigma_{n}^2\\ \end{bmatrix}\)

Note that:

  • \(Var(Y_i)=Cov(Y_i, Y_i)=\sigma_{i,i}=\sigma_i^2\)
  • \(Cov(Y_i, Y_j)=\sigma_{i,j}=\sigma_{j,i}=Cov(Y_j, Y_i)\) (Symmetric)
Properties

Let random vector \(\textbf{Y}^T=[Y_1, ..., Y_n]\) have a multivariate normal distribution with mean vector \(\boldsymbol \mu\) and variance-covariance matrix \(\boldsymbol \Sigma\). Let \(\textbf{A}\) be a \(p\times n\) matrix.

Then

  • Expectation: \(E[\textbf{AY}]=\textbf{A} \boldsymbol \mu\)

  • Variance: \(Cov[\textbf{AY}]=\textbf{A} Cov(\textbf{Y})\textbf{AY}^T=\textbf{A} \boldsymbol \Sigma \textbf{A}^T\)

Thus the distribution of \(\textbf{AY}\) is

\[\textbf{AY}\sim MVN(\textbf{A} \boldsymbol \mu, \textbf{A} \boldsymbol \Sigma \textbf{A}^T)\]

2.3.1 Individual Tests

\[\hat{\boldsymbol \beta} \sim MVN(\boldsymbol \beta, \sigma^2(\textbf{X}^T\textbf{X})^{-1})\]

then the distribution of an individual \(\hat{\beta}_j\) is given by

\[\hat{\beta}_j \sim N(\beta_j, \sigma^2 C_{j,j})\]

where \(C_{j,j}\) is the \(j^{th}\) diagonal entry o the \((\textbf{X}^T\textbf{X})^{-1}\) matrix.

As before we estimate \(\sigma^2\) with \(MS(Res)\), which under MLR is given by \(\hat{\sigma}^2=s^2=MS(Res)=SS(Res)/(n-p-1)\).

Thus, the standard error of \(\hat{\beta}_j\) is \[se(\hat{\beta}_j)=\hat{\sigma}\sqrt{C_{j,j}}\]

Then the test for an individual slope is

\[H_0: \beta_j=0\]

\[H_A: \beta_j \neq 0\] The test statistic is

\[t_j=\frac{\hat{\beta}_j}{se(\hat{\beta}_j)}=\frac{\hat{\beta}_j}{\hat{\sigma}\sqrt{C_{j,j}}}\]

The confidence interval is

\[\hat{\beta}_j \pm t^*_{df=n-p-1, \alpha/2} \times se(\hat{\beta}_j)\]

Mutliple Comparisons

So far we have considered individual hypothesis tests and confidence intervals for regression coefficients; however, if we perform multiple tests this will result in calibration errors!

If we want to construct “simultaneous” hypothesis tests and/or confidence intervals for \(\beta_1, \beta_2, ..., \beta_p\) we have to consider adjustments for multiple comparisons.

Bonferroni Correction

Adjust Type 1 error to \(\alpha/k\)

3. ANOVA

The following decomposition of the source of variance, or analysis of variance (ANOVA)

\[\sum_{i=1}^n(Y_i-\bar{Y})^2=\sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2+\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\]

Where we have the following sum of squares:

3.0,1 Quaratic Forms

Additional notation to define:

  • \(\textbf{1}_{n\times 1}\): Column vector of ones
  • \(\textbf{J}_{n\times n}=\textbf{1}\textbf{1}^T\): Matrix of ones
  • \(\textbf{I}_{n \times n}\): Identity matrix

Sum of Squares

  • \(SS(Total)=\textbf{Y}^T(\textbf{I}-\frac{1}{n}\textbf{J})\textbf{Y}\)
  • \(SS(Reg)=\textbf{Y}^T(\textbf{H}-\frac{1}{n}\textbf{J})\textbf{Y}\)
  • \(SS(Res)=\textbf{Y}^T(\textbf{I}-\textbf{H})\textbf{Y}\)

3.1 Anova Table

Source DF Sum of Squares Mean Squares F-value P-value
Regression \(p\) \(SS(Reg)\) \(MS(Reg)=SS(Reg)/p\) \(MS(Reg)/MS(Res)\) pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE)
Residual \(n-p-1\) \(SS(Res)\) \(MS(Res)=SS(Res)/(n-p-1)\)
Total \(n-1\) \(SS(Tot)\)

3.2 R-squared

The proportion of variability described by the model:

\[R^2=\frac{SS(Reg)}{SS(Tot)}=1-\frac{SS(Res)}{SS(Tot)}\]

3.3 Distributions of Sum of Squares

  • \(\frac{SS(Res)}{\sigma^2} \sim \chi^2 (df=n-p-1)\)
  • \(\frac{SS(Reg)}{\sigma^2} \sim \chi^2 (df=p)\)
  • \(\frac{SS(Tot)}{\sigma^2} \sim \chi^2 (df=n-1)\)

3.4 F-tests

3.4.1 Global F-test

Hypotheses:

\[H_0: \beta_1=\beta_2=...=\beta_p=0\]

\[H_A: \text{at least one } \beta_j\neq 0\]

Test Statistic:

\[F=\frac{\frac{SS(Reg)}{p}}{\frac{SS(Res)}{n-p-1}}=\frac{MS(Reg)}{MS(Res)}\]

Reference Distribution: \(F_{df1=p, df2=n-p-1}\)

3.4.2 General Linear Hypothesis

Hypotheses

\[H_0: \textbf{k}^T \boldsymbol \beta = \textbf{m}\]

\[H_A: \textbf{k}^T \boldsymbol \beta \neq \textbf{m}\]

Where

  • \(k^T\) is a \(r \times (p+1)\) matrix of rank
  • \(m\) is a \(r \times 1\) vector of constants (this represents a linear combination)

Under the null

\[(\textbf{k}^T \boldsymbol \beta - \textbf{m}) \sim MVN(\textbf{k}^T \boldsymbol \beta - \textbf{m}, \sigma^2 \textbf{k}^T(\textbf{X}^T\textbf{X})^{-1}\textbf{k})\]

Then we have the following distribution

\[(\textbf{k}^T \boldsymbol \beta - \textbf{m})^T[\sigma^2 \textbf{k}^T(\textbf{X}^T\textbf{X})^{-1}\textbf{k})]^{-1}(\textbf{k}^T \boldsymbol \beta - \textbf{m})\sim \chi^2_{df=r}\] Which yeilds the following quadratic form

\[\text{Q}=(\textbf{k}^T \boldsymbol \beta - \textbf{m})^T[\textbf{k}^T(\textbf{X}^T\textbf{X})^{-1}\textbf{k})]^{-1}(\textbf{k}^T \boldsymbol \beta - \textbf{m})\]

The the F-test is given by

\[F=\frac{\textbf{Q}/r}{s^2}\sim F_{df1=r, df2=n-p-1}\]

The general linear hypothesis can be used to compare full and reduced models, which is useful for variable selection.

3.4.3 Variable Selection

In this class we discussed:

  • Forward selection (starts will an empty/null model)
  • Backward elmination (starts with a saturated model)
  • Best subset selection