Lesson 2.1- Multiple Linear Regression Model (OLS Estimation)
Author
Norberto E. Milla, Jr.
Published
March 16, 2023
1 Regression model with two explanatory variables
In the simple regression analysis with only one explanatory variable the assumption SLR.4 (all other factors affecting y are uncorrelated with x, ceteris paribus) is generally very restrictive and unrealistic.
Multiple regression analysis allow us to explicitly control for many other factors that simultaneously affect y.
By adding new explanatory variables we can explain more of the variation in y. In other words, we can develop more successful models.
Additional advantage: multiple regression can incorporate general functional form relationships.
Example 1: Wage equation
wage = \beta_0 + \beta_1educ + \beta_2exper + u
wage: hourly wages (in US dollars); educ: level of education (in years); exper: level of experience (in years)
This wage equation allows us to measure the impact of education on wage holding experience fixed, and vice versa
\beta_1: measures the impact of education on wage, holding all other factors fixed.
\beta_2: measures the ceteris paribus effect of experience on wage.
Example 2: Student success and family income equation
avgscore = \beta_0 + \beta_1expend + \beta_2avginc + u
avgscore: average standardized test score; expend: education spending per student; avgincome: average family income
If avginc is not included in the model directly, its effect will be included in the error term, u.
Because average family income is generally correlated with education expenditures, the key assumption SLR.3 will be invalid: x (expend) will be correlated with u leading to biased OLS estimators.
Multiple regression analysis allow us to use more general functional forms.
Consider the following quadratic model of consumption
cons = \beta_0 + \beta_1inc + \beta_2inc^2 + u
Here x_1 = inc and x_2 = inc^2
\beta_1: we cannot fix inc^2 while changing inc
Marginal propensity to consume (MPC) is approximated by:
The regression model with 2 explanatory variables is formally given by:
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + u
Zero conditional mean” assumption: E(u|x_1,x_2) = 0
In other words, for all combinations of x_1 and x_2, the expected value of unobservables, u, is zero.
In the wage equation:
E(u|educ,exper) = 0
which means that unobservable factors affecting wage are not related on average with educ and exper
If ability is a part of u, then average ability levels must be the same across all combinations of education and experience in the population.
In the test scores and average family income equation, the unconditional mean assumption implies that
E(u|expend,avginc) = 0
All other factors affecting the average test scores (such as quality of schools, student characteristics, etc.) are, on average, unrelated to education expenditures and family income
For the quadratic consumption model:
E(u|inc, inc^2) = E(u|inc) = 0
Since inc^2 is automatically known when inc is known we do not need to write it in the conditional expectation.
\hat{\beta}_1 = 0.453: Holding ACT fixed, given a one-point change in high school GPA, college GPA is predicted to increase by 0.453 points (almost half a point)
If we choose two students, A and B, with the same ACT score but hsGPA of student A is one point higher than the hsGPA of student B, then we predict that student A’s college GPA is 0.453 points higher than that of student B’s.
\hat{\beta}_2 = 0.009: A point change in ACT increases college GPA by 0.009 (a very small effect), holding the effect of high school GPA constant
Logarithmic wage equation: wage: average hourly earnings, educ: years of education; exper: years potential experience; tenure: years with current employer
We have a random sample of n observations drawn from the population model defined in MLR.1.
MLR.3: No Perfect Collinearity
There is no perfect linear relationship between all independent x variables.
Any of the x variables cannot be written as a linear combination of other independent variables. If this assumption fails we have perfect collinearity (multicollinearity)
If x variables are perfectly collinear then it is not possible to mathematically determine OLS estimators.
This assumption allows the independent variables to be correlated: but they cannot be perfectly correlated.
If we did not allow for any correlation among the independent variables, then multiple regression would be of very limited use for econometric analysis
For example, in the regression of student GPA on education expenditures and family income, we suspect that expenditure and income may be correlated and so we would like to hold income fixed to find the impact of expenditure on GPA.
MLR.4: Zero conditional mean
E(u|x_1, x_2, \cdots, x_k) = 0
This assumption states that x explanatory variables are strictly exogenous. Random error term is uncorrelated with explanatory variables.
There are several ways this assumption can fail:
Functional form misspecification: if the functional relationship between the explained and explanatory variables is misspecified this assumption can fail.
Omitting an important variable that is correlated with any of explanatory variables
Measurement error in explanatory variables
Theorem: Under assumptions MLR.1 through MLR.4 OLS estimators are unbiased:
To determine the magnitude and sign of the bias we will substitute y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + u_i in the formula for \tilde{\beta}_1, re-arrange and take expectation
Therefore, the bias of \tilde{\beta}_1 is given by
\beta_2 \frac{\sum_{i=1}^n (x_{i1} - \overline{x}_1)x_{i2}}{\sum_{i=1}^n (x_{i1} - \overline{x}_1)^2}
The sign of bias depends on both \beta_2 and the correlation between omitted variable (x_2) and included variable (x_1)
It is not possible to calculate this correlation if omitted variable cannot be observed
The following table summarizes possible cases:
Code
knitr::include_graphics("pic12.png")
A small bias relative to \beta_1 may not be a problem in practice
In most cases we are not able to calculate the size of the bias. But in some cases we may have an idea about the direction of bias.
For example, suppose that in the wage equation true PRF contains both education and ability.
Suppose also that ability is omitted because it cannot be observed, leading to omitted variable bias.
In this case we can say that sign of the bias is (+) because it is reasonable to think that people with more ability tend to have higher levels of education and ability is positively related to wage
8 Variances of OLS estimators
MLR.5: Homoscedasticity
This assumption states that conditional on x variables error term has constant variance:
var(u|x_1, x_2, \cdots, x_k) = \sigma^2
If this assumption fails the model exhibits heteroscedasticity.
This assumption is essential in deriving variances and standard errors of OLS estimators and in showing whether OLS estimators are efficient
We do not need this assumption for unbiasedness
In the wage equation this assumption implies that, the variance of unobserved factors does not change with the factors included in the model (education, experience, tenure, etc.)
MLR.1 to MLR.5 assumptions are only valid for cross-sectional data. They need to be modified for time series data.
These assumptions are referred as the Gauss-Markov assumptions
Assumptions MLR.4 and MLR.5 can be restated in terms of the dependent variable: \begin{align}
E(y|x_1, x_2, \cdots, x_k) &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k \notag \\
var(y|x_1, x_2, \cdots, x_k) &= var(u|x_1, x_2, \cdots, x_k) = \sigma^2 \notag
\end{align}
Theorem:
Under the Gauss-Markov assumptions,
Var(\hat{\beta}_j) = \frac{\sigma^2}{SST_j(1-R_j^2)}, \; j = 1, 2, \cdots, k
where
SST_j = \sum_{i=1}^n (x_{ij} - \overline{x}_j)^2
is the total sample variation in x_{ij} and R_j^2 is the R-squared from regressing x_j on all other independent variables (including an intercept term)
Var(\hat{\beta}_j) moves in the same direction with \sigma^2 but in opposite direction with SST_j
To increase SST_j\; we need to collect more data (increase n).
To reduce \sigma^2\; we need to find good explanatory variables
Var(\hat{\beta}_j) also depends on R_j^2 which measures the degree of correlation among x variables
We did not have this term in the simple regression analysis because there was only one explanatory variable.
As the degree of correlation increases among x variables the variances of OLS estimators get larger and larger.
When there are high level of collinearity among x variables variances of OLS estimators will be larger. This is called multicollinearity problem
Estimating Variances
An unbiased estimator of the error variance is
\hat{\sigma}^2 = \frac{1}{n-k-1}\sum_{i=1}^n \hat{u}_i^2 = \frac{SSE}{n-k-1}
Consequently, the standard error of \hat{\beta}_js is given by
se(\hat{\beta}_j) = \frac{\hat{\sigma}}{\sqrt{SST_j(1-R_j^2)}}, \; j = 1, 2, \cdots, k
\hat{\sigma} = \sqrt{\hat{\sigma}^2} is called the root mean square error (RMSE); also referred to as stardard error of regression (SER)
It is the estimator of the standard deviation of the error term
It increases or decreases when a new variable is added in the model
It is used in constructing confidence intervals and testing hypothesis
Under the assumptions MLR.1 through MLR.5, OLS estimators are the best linear unbiased estimators (BLUE) for unknown population parameters
best: minimum variance (efficient)
linear: the beta coefficients are linear function of the y_is, i. e., \hat{\beta}_j = \sum_{i=1}^n w_{ij}y_i
unbiased: E(\hat{\beta}_j) = \beta_j
If any of the five assumptions fails, the Gauss-Markov theorem does not hold
When MLR.4 fails unbiasedness property does not hold