Zahid Asghar, School of Economics, Quaid-i-Azam University
11/18/2020
One of the assumptions of the classical linear regression (CLRM) is that there is no exact linear relationship among the regressors. \(Y_{i}=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+...+\beta_2 X_{ki}+e_{i}\)
If there are one or more such relationships among the regressors, we call it multicollinearity, or collinearity for short.
Perfect collinearity: A perfect linear relationship between the two variables exists.
Imperfect collinearity: The regressors are highly (but not perfectly) collinear.
Purpose
Prediction/Forecasting : MC is not an issue
If measuring effect of one variable on Y, holding others constant, then important
Data Issue
If collinearity is not perfect, but high, several consequences ensue:
The OLS estimators are still BLUE, but one or more regression coefficients have large standard errors relative to the values of the coefficients, thereby making the t ratios small.
Even though some regression coefficients are statistically insignificant, the R2 value may be very high.
Therefore, one may conclude (misleadingly) that the true values of these coefficients are not different from zero.
Also, the regression coefficients may be very sensitive to small changes in the data, especially if the sample is relatively small.
For the following regression model: \(Y_{i}=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+e_{i}\)
It can be shown that :
\({Var}(\hat{\beta_1}) = \sigma^2 \left( \frac{1}{1 - r_{12}^2} \right) \frac{1}{S_{x_j x_j}}\)
\({Var}(\hat{\beta_2}) = \sigma^2 \left( \frac{1}{1 - r_{21}^2} \right) \frac{1}{S_{x_j x_j}}\)
\(S_{x_j x_j} = \sum(x_{ij}-\bar{x}_j)^2\)
Variance Inflation Factor (VIF) is defined as \(\frac{1}{1 - r_{12}^2}\) and \(\frac{1}{1 - r_{21}^2}\)
High R2 but few significant t ratios.
High pair-wise correlations among explanatory variables or regressors.
High partial correlation coefficients.
Significant F test for auxiliary regressions (regressions of each regressor on the remaining regressors).
High Variance Inflation Factor (VIF) – particularly exceeding 20 (some suggest 10) in value – and low Tolerance Factor (TOL, the inverse of VIF).
We use two subsets of longley data, one with all 16 years data and one with year 1962 omitted from it. It can be noticed that just adding 1962 year data , makes a big change in coefficient magnitued. We have also calculated VIF for both models.
| md62 | full | |
|---|---|---|
| (Intercept) | 1459415.07 | 1169087.53 |
| (714182.87) | (835902.44) | |
| year | -721.76 | -576.46 |
| (369.98) | (433.49) | |
| price | -181.12 | -19.77 |
| (135.52) | (138.89) | |
| gnp | 0.09 ** | 0.06 ** |
| (0.02) | (0.02) | |
| armed | -0.07 | -0.01 |
| (0.26) | (0.31) | |
| N | 15 | 16 |
| R2 | 0.98 | 0.97 |
| logLik | -113.20 | -123.76 |
| AIC | 238.41 | 259.52 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
## year price gnp armed
## 121.533754 87.346665 154.075397 1.559474
## year price gnp armed
## 143.463545 75.670734 132.463801 1.553191
Obtain more data but easier said than done
More information : not necessarily means more observations
Bias and Variance trade off
Main remedy is force theory on data
Principal Components but observe caution as many times combining variables has no interpretation. For example, how do we interpret the price elasticity minus twice the income elasticity?