Let \((x_1, Y_1), (x_2, Y_2), ..., (x_n, Y_n)\) be \(n\) bivariate data points observed from a simple linear model \[Y_i=\beta_0+\beta_1 x_i+\varepsilon_i, i=1, 2, ..., n\]
\[\varepsilon_i \sim Normal(0, \sigma^2)\]
Errors are in independent and identically distributed (iid)
Least square estimators, \(\hat{\beta}_0\) and \(\hat{\beta}_1\), are found by minimizing \(SS(Res)=\sum_{i=1}^n (Y_i-\hat{Y}_i)^2\)
Type | Parameter | Estimator |
---|---|---|
Y-intercept | \(\beta_0\) | \(\hat{\beta}_0=\bar{Y}-\hat{\beta}_1 \bar{x}\) |
Slope | \(\beta_1\) | \(\hat{\beta}_1 =\frac{n\sum_{i=1}^n x_iY_i-\sum_{i=1}^n x_i \sum_{i=1}^n Y_i}{n\sum_{i=1}^n x_i^2-(\sum_{i=1}^n x_i)^2}=\frac{\sum_{i=1}^n (x_i-\bar{x})(Y_i-\bar{Y})}{\sum_{i=1}^n (x_i-\bar{x})^2}=\frac{\sum_{i=1}^n (x_i-\bar{x})Y_i}{\sum_{i=1}^n (x_i-\bar{x})^2} = r\times \frac{s_Y}{s_x}\) |
Variance | \(\sigma^2\) | \(s^2=MS(Res)=SS(Res)/n-2\) |
where
\[\hat{\beta}_1 \sim Normal(\beta_1, \frac{1}{\sum_{i=1}^n (x_i-\bar{x})^2} \sigma^2)\]
\[\hat{\beta}_0 \sim Normal(\beta_0, (\frac{1}{n}+\frac{1}{\sum_{i=1}^n (x_i-\bar{x})^2}) \sigma^2)\]
The \(i^{th}\) residual is given by \[\hat{e}_i=Y_i-\hat{Y}_i\]
\(E[SS(Res)]=(n-2)\sigma^2\)
\(SS(Res)\) is independent of both the least squares estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
Parameters:
Properties:
Parameters:
Properties:
Parameters:
Properties:
Parameters:
Properties:
Student t-distribution with \(n-2\) degrees of freedom
\[t=\frac{(\hat{\beta}_1-\beta_{1,0})\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}}{\sqrt{SS(Res)/n-2}}=\frac{(\hat{\beta}_1-\beta_{1,0})\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2}}{\sqrt{MS(Res)}}\]
We can simplify this to say: \[t=\frac{\hat{\beta}_1-\beta_{1,0}}{SE(\hat{\beta}_1)}\]
Where \(SE(\hat{\beta}_1)\) is the standard error for the slope estimator and is given by \[SE(\hat{\beta}_1)=\frac{\sqrt{SS(Res)/(n-2)}}{\sqrt{\sum_{i=1}^n (x_i -\bar{x})^2}}=\frac{\sqrt{MS(Res)}}{\sqrt{\sum_{i=1}^n (x_i -\bar{x})^2}}=\frac{s}{\sqrt{\sum_{i=1}^n (x_i -\bar{x})^2}}\]
Typically, we use \(\beta_{1,0}=0\) to test for the significance of a relationship between our response (\(Y\)) and explantory (\(x\)) variables.
Thus, the test statistic is:
\[t=\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}\]
Once the model is fitted we can use it to make predictions on the value of the response variable \(Y\), for any given value of \(x=x_0\).
We simple input \(x=x_0\) into the fitted equation \[\hat{\beta}_0+\hat{\beta}_1 x_0\]
\[\mu_{Y|x_0}=E[Y|x_0]=\beta_0+\beta_1 x_0\]
The form of a \(100(1-\alpha)%\) confidence interval for the mean response is given by
\[(\hat{\beta}_0+\hat{\beta}_1 x_0)\pm t_{df=n-2, \alpha/2}^* \times \sqrt{MS(Res)\times (\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2})}\]
A confidence interval for the mean response does not reflecct the bounds in which we are realisitically expecting to observe a single new observation at \(x=x_0\). A single observation has more variablity the average of observations. Therefore, we account for that by changing the error term.
The form of a \(100(1-\alpha)%\) prediction interval for the mean response is given by
\[(\hat{\beta}_0+\hat{\beta}_1 x_0)\pm t_{df=n-2, \alpha/2}^* \times \sqrt{MS(Res)\times (1+\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\sum_{i=1}^n (x_i-\bar{x})^2})}\]
Let \(Y\) be a response variable which can be possible explained by predictors \(x_1, x_2, ..., x_p\). We say there is a linear model that describes a potential relationship between \(Y\) and \(x_1, x_2, ..., x_p\) if this relationship can be expressed as:
\[Y_i=\beta_0+\beta_1 x_{i, 1}+\beta_2 x_{i, 2}+ ... +\beta_p x_{i, p}+ \varepsilon_i\]
Where
\(\textbf{Y}_{n\times 1}=\begin{bmatrix} Y_1\\ Y_2\\ .\\ .\\ .\\ Y_n \end{bmatrix}\)
\(\textbf{X}_{n\times (p+1)}=\begin{bmatrix} 1 & x_{1,1} & x_{1,2}& .& .& .& x_{1,p}\\ 1 & x_{2,1} & x_{2,2}& .&.& .& x_{2,p}\\ . & . & .& .& & &.\\ . & . & .& &.& & .\\ . & . & .& && .& .\\ 1 & x_{n,1} & x_{n,2}& .&.& .& x_{n,p}\\ \end{bmatrix}\)
\(\boldsymbol \beta_{(p+1)\times 1}=\begin{bmatrix} \beta_0\\ \beta_1\\ .\\ .\\ .\\ \beta_p \end{bmatrix}\)
\(\boldsymbol \varepsilon_{n\times 1}=\begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ .\\ .\\ .\\ \varepsilon_n \end{bmatrix}\)
Thus we can express the model using matrices as
\[\textbf{Y}_{n\times 1}=\textbf{X}_{n\times (p+1)}\boldsymbol \beta_{(p+1)\times 1}+\boldsymbol \varepsilon_{n\times 1}\]
The least squares estimator of \(\beta\) is given by
\[\hat{\boldsymbol \beta}=(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{Y}\]
Thus the least squares regression equation is given by:
\[\hat{\textbf{Y}}=\textbf{X}\hat{\boldsymbol \beta}=\textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{Y}\]
The \(n\times n\) matrix, known as the hat matrix, is given by
\[\textbf{H}=\textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\]
The least squares predicted values \(\hat{\textbf{Y}}=\textbf{H}\textbf{Y}\) can be considered as the image of \(\textbf{Y}\) under the projection \(\textbf{H}\)
The hat matrix is:
The deviation between \({\textbf{Y}}\) and \(\hat{\textbf{Y}}\) is the vector of residuals.
\[{\textbf{e}}=\textbf{Y}-\hat{\textbf{Y}}=\textbf{Y}-\textbf{H}\textbf{Y}=(\textbf{I}-\textbf{H})\textbf{Y}\]
A random vector \(\textbf{Y}^T=[Y_1, ..., Y_n]\) is said to have a multivariate normal distribution with mean parameter vector \(\boldsymbol \mu^T=[\mu_1, ..., \mu_n]\) and variance-covariance matrix \(\boldsymbol \Sigma\) , if it has the joint probability density function
\[f(Y_1, ..., Y_n)=f(\textbf{Y})=(\frac{1}{\sqrt{2 \pi}})^n \frac{1}{det(\boldsymbol \Sigma)}^{1/2} e^{-\frac{1}{2}(\textbf{Y}-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1}(\textbf{Y}-\boldsymbol \mu)}\]
Notation: \[\textbf{Y}\sim MVN(\boldsymbol \mu, \boldsymbol \Sigma)\]
\(MVN\): Multivariate Normal
The \((i,j)^{th}\) entry of the variance-covariance matrix \(\boldsymbol \Sigma\) is
\[\sigma_{i,j}=Cov(Y_i, Y_j)=\rho_{i,j}\sigma_i \sigma_j\]
where \(\rho_{i,j}\) is the correlation coefficient bettern \(Y_i\) and \(Y_j\)
\(\boldsymbol \Sigma=\begin{bmatrix} \sigma_1^2 & \sigma_{1,2}&\sigma_{1,3}& .& .& .& \sigma_{1,n}\\ \sigma_{2,1}& \sigma_2^2 & \sigma_{2,3}& .&.& .& \sigma_{2,n}\\ . & . & .& .& & &.\\ . & . & .& &.& & .\\ . & . & .& && .& .\\ \sigma_{n,1} & \sigma_{n,2} & \sigma_{n,3}& .&.& .& \sigma_{n}^2\\ \end{bmatrix}\)
Note that:
Let random vector \(\textbf{Y}^T=[Y_1, ..., Y_n]\) have a multivariate normal distribution with mean vector \(\boldsymbol \mu\) and variance-covariance matrix \(\boldsymbol \Sigma\). Let \(\textbf{A}\) be a \(p\times n\) matrix.
Then
Expectation: \(E[\textbf{AY}]=\textbf{A} \boldsymbol \mu\)
Variance: \(Cov[\textbf{AY}]=\textbf{A} Cov(\textbf{Y})\textbf{AY}^T=\textbf{A} \boldsymbol \Sigma \textbf{A}^T\)
Thus the distribution of \(\textbf{AY}\) is
\[\textbf{AY}\sim MVN(\textbf{A} \boldsymbol \mu, \textbf{A} \boldsymbol \Sigma \textbf{A}^T)\]
\[\hat{\boldsymbol \beta} \sim MVN(\boldsymbol \beta, \sigma^2(\textbf{X}^T\textbf{X})^{-1})\]
then the distribution of an individual \(\hat{\beta}_j\) is given by
\[\hat{\beta}_j \sim N(\beta_j, \sigma^2 C_{j,j})\]
where \(C_{j,j}\) is the \(j^{th}\) diagonal entry o the \((\textbf{X}^T\textbf{X})^{-1}\) matrix.
As before we estimate \(\sigma^2\) with \(MS(Res)\), which under MLR is given by \(\hat{\sigma}^2=s^2=MS(Res)=SS(Res)/(n-p-1)\).
Thus, the standard error of \(\hat{\beta}_j\) is \[se(\hat{\beta}_j)=\hat{\sigma}\sqrt{C_{j,j}}\]
Then the test for an individual slope is
\[H_0: \beta_j=0\]
\[H_A: \beta_j \neq 0\] The test statistic is
\[t_j=\frac{\hat{\beta}_j}{se(\hat{\beta}_j)}=\frac{\hat{\beta}_j}{\hat{\sigma}\sqrt{C_{j,j}}}\]
The confidence interval is
\[\hat{\beta}_j \pm t^*_{df=n-p-1, \alpha/2} \times se(\hat{\beta}_j)\]
So far we have considered individual hypothesis tests and confidence intervals for regression coefficients; however, if we perform multiple tests this will result in calibration errors!
If we want to construct “simultaneous” hypothesis tests and/or confidence intervals for \(\beta_1, \beta_2, ..., \beta_p\) we have to consider adjustments for multiple comparisons.
Adjust Type 1 error to \(\alpha/k\)
The following decomposition of the source of variance, or analysis of variance (ANOVA)
\[\sum_{i=1}^n(Y_i-\bar{Y})^2=\sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2+\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\]
Where we have the following sum of squares:
Additional notation to define:
Source | DF | Sum of Squares | Mean Squares | F-value | P-value |
---|---|---|---|---|---|
Regression | \(p\) | \(SS(Reg)\) | \(MS(Reg)=SS(Reg)/p\) | \(MS(Reg)/MS(Res)\) | pf(f_val, df1=p, df2=n-p-1, lower.tail=FALSE) |
Residual | \(n-p-1\) | \(SS(Res)\) | \(MS(Res)=SS(Res)/(n-p-1)\) | — | — |
Total | \(n-1\) | \(SS(Tot)\) | — | — | — |
The proportion of variability described by the model:
\[R^2=\frac{SS(Reg)}{SS(Tot)}=1-\frac{SS(Res)}{SS(Tot)}\]
Hypotheses:
\[H_0: \beta_1=\beta_2=...=\beta_p=0\]
\[H_A: \text{at least one } \beta_j\neq 0\]
Test Statistic:
\[F=\frac{\frac{SS(Reg)}{p}}{\frac{SS(Res)}{n-p-1}}=\frac{MS(Reg)}{MS(Res)}\]
Reference Distribution: \(F_{df1=p, df2=n-p-1}\)
Hypotheses
\[H_0: \textbf{k}^T \boldsymbol \beta = \textbf{m}\]
\[H_A: \textbf{k}^T \boldsymbol \beta \neq \textbf{m}\]
Where
Under the null
\[(\textbf{k}^T \boldsymbol \beta - \textbf{m}) \sim MVN(\textbf{k}^T \boldsymbol \beta - \textbf{m}, \sigma^2 \textbf{k}^T(\textbf{X}^T\textbf{X})^{-1}\textbf{k})\]
Then we have the following distribution
\[(\textbf{k}^T \boldsymbol \beta - \textbf{m})^T[\sigma^2 \textbf{k}^T(\textbf{X}^T\textbf{X})^{-1}\textbf{k})]^{-1}(\textbf{k}^T \boldsymbol \beta - \textbf{m})\sim \chi^2_{df=r}\] Which yeilds the following quadratic form
\[\text{Q}=(\textbf{k}^T \boldsymbol \beta - \textbf{m})^T[\textbf{k}^T(\textbf{X}^T\textbf{X})^{-1}\textbf{k})]^{-1}(\textbf{k}^T \boldsymbol \beta - \textbf{m})\]
The the F-test is given by
\[F=\frac{\textbf{Q}/r}{s^2}\sim F_{df1=r, df2=n-p-1}\]
The general linear hypothesis can be used to compare full and reduced models, which is useful for variable selection.
In this class we discussed: