Multiple Regression

Multiple Linear Regression

Author

Henry Athiany

1 Introduction

Multiple linear regression is an extension of simple linear regression model to predict an outcome variable, say \(Y\), on the basis of multiple distinct predictor variables \((X_i)\).

Here we shall learn about:

Matrix formulation of regression models
Parameter estimation
Inference and predictions
Extra sum of squares
Multicollinearity

Suppose we have more than one predictor (explanatory) variable, say \(p-1\) of them i.e. \(X_1, X_2, X_3, \ldots, X_{p-1}\)

Then the linear regression model which includes all these variables is called multiple linear regression model and is given by

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_{p-1} X_{ip-1} + \varepsilon_i \]

The model can also be written as

\[ Y_i = \beta_0 + \sum_{j=1}^{p-1} \beta_j X_{ij} + \varepsilon_i, \qquad \varepsilon_i \sim N(0,\sigma^2) \]

If we let \(X_0 = 1\), then the model reduces to

\[ Y_i = \sum_{j=0}^{p-1} \beta_j X_{ij} + \varepsilon_i \]

This is a straight forward extension of the simple linear regression model.

For example

Now, suppose we have two predictor variables \(X_1\) and \(X_2\), then we will assume the following model:

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \varepsilon_i \]

\(Y_i\) is the response for trial \(i\).

\(\beta_0, \beta_1\) and \(\beta_2\) are the parameters in the model.

Then,

\[ E(\varepsilon_i) = 0 \]

\[ E(Y_i) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \]

is a plane

\[ \varepsilon_i = Y_i - E(Y_i) \]

\(\beta_1\) is the change in response per unit increase in \(X_1\) when \(X_2\) stays constant.

The coefficients \(\beta_1\) and \(\beta_2\) are called partial regression coefficients.

2 General Linear Regression Model

Suppose we let \(X_1, X_2, \ldots, X_{p-1}\) be the predictor variables, then the model is

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_{p-1} X_{ip-1} + \varepsilon_i \]

The parameters are \(\beta_0, \beta_1, \ldots, \beta_{p-1}\), known constants are \(X_{i1}, X_{i2}, \ldots, X_{ip-1}\), \(\varepsilon_i \sim N(0,\sigma^2)\) with \(i = 1, 2, \ldots, n\)

Note:

If we let

\(X_{i0} = 1\)

Thus,

\[ Y_i = \sum_{j=0}^{p-1} \beta_j X_{ij} + \varepsilon_i \]

\[ = \beta_0 X_{i0} + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_{p-1} X_{ip-1} + \varepsilon_i \]

Now: Suppose \(X_{ij}\) is Gender or Race

How do we work with such variables?

In such situations we shall work with dummy variables: \(0, 1\).

e.g.

\(Y\) : Length of stay in hospital
\(X_1\) : Age
\(X_2\) : Gender

Model: \[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \varepsilon_i \]

\(X_{i1}\) = patient’s age
\(X_{i2}\) : Female (1), Male (0)

Male: \(E(Y)=\beta_0 + \beta_1 X_1\)

Female: \(E(Y)=\beta_0 + \beta_1 X_1 + \beta_2X_2\)

and many more models.

In addition, we may have interactions or combinations of cases.

Interaction:

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \beta_3 X_{i1} X_{i2} + \varepsilon_i \]

Combination:

Still linear?

# Yes, linear in regression.

3 Matrix Formulation

Let’s consider the following multiple linear regression model:

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_{p-1} X_{ip-1} + \varepsilon_i, \qquad i = 1, 2, \ldots, n \]

When \(i = 1\):

\[ Y_1 = \beta_0 + \beta_1 X_{11} + \beta_2 X_{12} + \cdots + \beta_{p-1} X_{1p-1} + \varepsilon_1 \]

When \(i = 2\):

\[ Y_2 = \beta_0 + \beta_1 X_{21} + \beta_2 X_{22} + \cdots + \beta_{p-1} X_{2p-1} + \varepsilon_2 \]

When \(i = 3\):

\[ Y_3 = \beta_0 + \beta_1 X_{31} + \beta_2 X_{32} + \cdots + \beta_{p-1} X_{3p-1} + \varepsilon_3 \] \[ \vdots \] \[ \vdots \]

When \(i = n\):

\[ Y_n = \beta_0 + \beta_1 X_{n1} + \beta_2 X_{n2} + \dots + \beta_{p-1} X_{np-1} + \varepsilon_n \]

In Matrix form, this can be written as:

\[ \begin{bmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & X_{11} & X_{12} & \cdots & X_{1p-1} \\ 1 & X_{21} & X_{22} & \cdots & X_{2p-1} \\ 1 & X_{31} & X_{32} & \cdots & X_{3p-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & X_{n2} & \cdots & X_{np-1} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{p-1} \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \vdots \\ \varepsilon_n \end{bmatrix} \]

That is,

\[ Y = X\beta + \varepsilon \]

Matrix dimensions:

\[ (n \times 1) = (n \times p)(p \times 1) + (n \times 1) \] The general Linear Regression Model (Matrix Form)

\[ Y = X\beta + \varepsilon \]

Where:

\(Y\) is the vector of responses (dependent variable)
\(\beta\) is the vector of regression parameters
\(X\) is the design matrix made up of explanatory variables and one column of constants (general matrix of constants)
\(\varepsilon\) is the vector of independent normal random variables with \[ E(\varepsilon) = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \] and \[ \operatorname{Var}(\varepsilon) = E(\varepsilon \varepsilon^T) \]

Proof

\[ E(Y) = E(X\beta + \varepsilon) = X\beta + E(\varepsilon) \]

Since:

\[ E(\varepsilon_1) = 0, \; E(\varepsilon_2) = 0, \; \ldots, \; E(\varepsilon_n) = 0 \]

Then:

\[ E(Y) = X\beta \]

Error Covariance

\[ \operatorname{Var}(\varepsilon) = E(\varepsilon \varepsilon^T)= E(\varepsilon \varepsilon') \]

Since \(\varepsilon\) is a nx1 matrix, that is:

\[ \varepsilon = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}, \qquad \varepsilon^T = (\varepsilon_1, \varepsilon_2, \ldots, \varepsilon_n) \]

Then:

\[ \varepsilon \varepsilon^T = \begin{bmatrix} \varepsilon_1^2 & \varepsilon_1 \varepsilon_2 & \cdots & \varepsilon_1 \varepsilon_n \\ \varepsilon_2 \varepsilon_1 & \varepsilon_2^2 & \cdots & \varepsilon_2 \varepsilon_n \\ \vdots & \vdots & \ddots & \vdots \\ \varepsilon_n \varepsilon_1 & \varepsilon_n \varepsilon_2 & \cdots & \varepsilon_n^2 \end{bmatrix} \]

This is what we refer to as the error covariance matrix. Since the error terms are said to be independent and have a common/constant variance \(\sigma^2\),

\[ E(\varepsilon_i, \varepsilon_j) = 0 \dots for \;\; (i \neq j), \qquad E(\varepsilon_i^2) = \sigma^2 \]

Thus,

\[ \sigma^2(\varepsilon) = \begin{bmatrix} \sigma_1^2 & 0 & 0 & \cdots & 0 \\ 0 & \sigma_2^2 & 0 & \cdots & 0 \\ 0 & 0 & \sigma_3^2 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \sigma_n^2 \end{bmatrix} \]

\[ \operatorname{Var}(\varepsilon) = \sigma^2 \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix} = \sigma^2 I_n \]

\[ Y \mid X \sim N(X\beta, \sigma^2 I_n) \]

That is, in matrix terms the normal regression model can be written as:

\[ Y = X\beta + \varepsilon, \]

Where

\[ \quad E(\varepsilon) = 0, \quad \operatorname{Var}(\varepsilon) = \sigma^2 I_n, \quad \varepsilon \sim N(0_n, \sigma^2 I_n) \] \(\varepsilon \sim MVN(0_n, \sigma^2 I_n)\)

and the variance-covariance matrix of \(Y\) is the same as that of the error term.

Variance of \(Y\)

\[ E(Y) = E(X\beta) + E(\varepsilon) = X\beta \] Since \(E(\varepsilon)=0\) \[ \operatorname{Var}(Y) = \operatorname{Var}(X\beta + \varepsilon) = \sigma^2 I_n \] Since \(X\beta\) is constant and \(\operatorname{var}(constant)=0\)

\[ \operatorname{Var}(Y) = 0 + \operatorname{Var}(\varepsilon) \]

\[ \operatorname{Var}(Y) = \sigma^2 I_n \]

Finally,

\[ Y \sim N(X\beta, \sigma^2 I_n) \]

4 Estimation of Regression Coefficients

If we remember the normal equations that we derived,

\[ n\beta_0 + \beta_1 \sum X_i = \sum Y_i \]

\[ \beta_0 \sum X_i + \beta_1 \sum X_i^2 = \sum X_i Y_i \]

and the fact that

\[ X'X = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_1 & X_2 & \cdots & X_n \end{bmatrix} \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_n \end{bmatrix} = \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix} \]

Also,

\[ X'Y = \begin{bmatrix} 1 & \cdots & 1 \\ X_1 & \cdots & X_n \end{bmatrix} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix} \]

Then we can see that these equations are equivalent to the following matrix operation

\[ X'X \, \beta = X'Y \]

With

\[ \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \]

with the solution to this equation given by

\[ \beta = (X'X)^{-1} X'Y \]

when \((X'X)^{-1}\) exists.

When Inverse Does Not Exist

If the rank of an \(n \times n\) matrix is less than \(n\), then it does not have an inverse.
If determinant is equal to zero, then inverse does not exist.

But when does \((X'X)^{-1}\) exist?

\(X\) is an \(n \times p\) (or \(p +1\) depending on how you define \(p\)) design matrix.
\(X\) must have full column rank in order for the inverse to exist.
i.e.

\[ \operatorname{rank}(X) = p \Rightarrow (X'X)^{-1} \text{ exists} \]

4.1 Ordinary Least Square Approach (OLS)

The strategy in the least square approach is the same as in the bivariate linear regression model.

First we calculate the sum of squares due to residuals.
Secondly we find a set of estimators that minimize the sum of squares due to residuals.

Thus, minimizing this problem of the sum of squared residuals in matrix form is given by

\[ Y = X\beta + \varepsilon \]

\[ \varepsilon = Y - X\beta \]

\[ \varepsilon^2 = (Y - X\beta)^2 \]

but we know that \(\varepsilon^2=\varepsilon'\varepsilon\)

Hence,

\[ \varepsilon'\varepsilon = (Y - X\beta)'(Y - X\beta) \]

\[ = Y'Y - Y'X\beta - \beta'X'Y + \beta'X'X\beta \]

\[ = Y'Y - 2\beta'X'Y + \beta'X'X\beta \]

Let

\[ Q = Y'Y - 2\beta'X'Y + \beta'X'X\beta \]

We wish to find the vector of least square estimations that minimizes

\[ L = \sum \varepsilon^2 = \varepsilon'\varepsilon = (Y - X\beta)'(Y - X\beta) \]

Then taking partial derivatives with respect to \(\beta\), we obtain

\[ \frac{\partial Q}{\partial \beta} = -2X'Y + 2X'X\beta \]

To minimize,

\[ \frac{\partial Q}{\partial \beta} = 0 \]

\[ \Rightarrow -2X'Y + 2X'X\beta = 0 \]

\[ \Rightarrow X'X\beta = X'Y \]

\[ \Rightarrow (X'X)^{-1} X'X \beta = (X'X)^{-1} X'Y \]

\[ \Rightarrow \hat{\beta} = (X'X)^{-1} X'Y \]

This is the resulting least square estimator for the multiple linear regression model in the matrix form.

This is sometimes called Ordinary Least Squares (OLS) estimator.

The fitted values are given by

\[ \hat{Y} = X\hat{\beta} \]

4.1.1 Unbiasedness of the OLS Estimator \(\hat{\beta}\)

Here we show the unbiasedness of the OLS estimator under the following assumptions:

Linearity in parameters
Zero onditional mean \[E(\varepsilon \mid X) = 0\]
No perfect collinearity, i.e. \(X\) has full column rank
The OLS estimator is given by

\[ \hat{\beta} = (X'X)^{-1}X'Y \]

We therefore need to show that:

\[ E(\hat{\beta}) = \beta \quad \text{or} \quad E(\hat{\beta} - \beta) = 0 \]

From assumption (i),

\[ Y = X\beta + \varepsilon \]

Substitute into the estimator:

\[ \hat{\beta} = (X'X)^{-1}X'(X\beta + \varepsilon) \]

\[ = (X'X)^{-1}X'X\beta + (X'X)^{-1}X'\varepsilon \]

\[ \hat{\beta} = \beta + (X'X)^{-1}X'\varepsilon \]

Taking expectations:

\[ E(\hat{\beta}) = E\left[\beta + (X'X)^{-1}X'\varepsilon \right] \]

\[ = E(\beta) + (X'X)^{-1}X' E(\varepsilon) \]

But \(E(\varepsilon) = 0\),

\[ E(\hat{\beta}) = \beta \]

Hence the OLS estimator is unbiased.

Bias

\[ E(\hat{\beta} - \beta)= E\left[\beta + (X'X)^{-1}X'\varepsilon - \beta \right ] = E\left[(X'X)^{-1}X'\varepsilon \right] = 0 \]

Thus,

\[ E(\hat{\beta} - \beta) = 0 \]

Under assumptions (i) and (ii), the OLS estimator is unbiased. We can also show that variance of the OLS estimate is the Best Linear Unbiased Estimator (BLUE)

Since

\[ \operatorname{Var}(\varepsilon \mid X) = \sigma^2 \]

and

\[ \operatorname{Cov}(\varepsilon_i, \varepsilon_j) = 0 \quad \text{for } i \neq j, \]

then,

\[ \operatorname{Var}(\varepsilon \mid X) = \sigma^2 I_n \]

Because of these assumptions, we have

\[ \hat{\beta} = (X'X)^{-1}X'Y \]

and

\[ Y=X\beta + \varepsilon \] Then

\[ \hat{\beta} = (X'X)^{-1}X'(X\beta + \varepsilon) \]

\[ \hat{\beta} = (X'X)^{-1}X'X\beta + (X'X)^{-1} X'\varepsilon \] \[ \hat{\beta} = \beta + (X'X)^{-1} X'\varepsilon \]

Then

\[ \operatorname{Var}(\hat{\beta}) = \operatorname{Var}\left[\beta + (X'X)^{-1}X'\varepsilon \right] \]

\[ = \operatorname{Var}\left(\beta ) + \operatorname{Var}(X'X^{-1}X'\varepsilon \right) \]

\[ = (X'X)^{-1}X' \operatorname{Var}(\varepsilon) X (X'X)^{-1} \]

\[ = (X'X)^{-1}X' (\sigma^2 I_n) X (X'X)^{-1} \]

\[ = \sigma^2 (X'X)^{-1}X'X (X'X)^{-1} \]

\[ \operatorname{Var}(\hat{\beta}) = \sigma^2 (X'X)^{-1} \]

Under assumptions (i) to (iv), \(\hat{\beta}\) is the Best Linear Unbiased Estimator (BLUE).

Since \(\sigma^2\) is unknown, it can be estimated by the Mean Square Error (MSE):

\[ \text{MSE} = \frac{\sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2}{n - p} = \frac{(Y - X\hat{\beta})'(Y - X\hat{\beta})}{n - p} \]

where \(p\) is the number of parameters in the model

\[ \operatorname{Var}(\hat{\beta}) = \text{MSE} \, (X'X)^{-1} \]

4.1.2 Interval Estimation of \(\hat{\beta_k}\)

This is an extension of the interval estimation in the simple linear regression.
In SLR, we estimated 2 unknown parameters \(\beta_0\) and \(\beta_1\), in multiple linear regression (MLR), we estimate \(p + 1\) parameters (\(\beta_0\),\(\beta_1\),\(\beta_3\),\(\beta_4\) \(\dots\),\(\beta_{p-1}\)).
Thus, we lose \(p+1\) degrees of freedom, which leaves:

\[ n - p \]

The test statistic is:

\[ t = \frac{\hat{\beta}_k - \beta_k}{SE(\hat{\beta}_k)} \sim t_{n-p}, \quad k = 0, 1, 2, \ldots, p \]

The \((1-\alpha)\)% confidence interval is given by:

\[ \hat{\beta}_k \pm t_{1-\alpha/2,(n-p)} \, SE(\hat{\beta}_k) \]