EC3133
Omitted Variable Bias
Generate sample data on education, income and ability:
Using the generated data, let’s first estimate the effect of education on income without controlling for ability:
Now do the same, but this time include both education and ability as regressors:
## [1] "Omitted Variable Bias Results:"
##
##
## |Model | Coefficient| Std_Error|
## |:-------------------|-----------:|---------:|
## |Incorrect (Omitted) | 8825.637| 224.598|
## |Correct | 4746.707| 344.608|
Does exercising make people happier?
If we look at the data, it certainly looks like exercising is positively correlated with happiness. Does this mean if I take any random person on the street and increase their frequency of exercise, that person will be happier? No!
Exercise may be endogenous: Happy people may exercise more.
So we probably observe this positive correlation between exercise and happiness because:
How do we distinguish between these two scenarios?
Use gym membership discounts as an instrument:
# Simulate data
set.seed(456)
n <- 300
discount <- rbinom(n, 1, 0.4) # Instrument
exercise <- 3 + 2 * discount + rnorm(n, 0, 1) # Endogenous variable
happiness <- 70 + 5 * exercise + rnorm(n, 0, 10) # Outcome
exercise_data <- data.frame(discount, exercise, happiness)
# First stage
first_stage_ex <- lm(exercise ~ discount, data = exercise_data)
exercise_data$exercise_hat <- predict(first_stage_ex)
# Second stage
second_stage_ex <- lm(happiness ~ exercise_hat, data = exercise_data)# Compare OLS vs IV
ols_model_ex <- lm(happiness ~ exercise, data = exercise_data)
results_comparison_ex <- data.frame(
Method = c("OLS", "IV"),
Coefficient = c(coef(ols_model_ex)[2], coef(second_stage_ex)[2]),
Std_Error = c(summary(ols_model_ex)$coefficients[2,2],
summary(second_stage_ex)$coefficients[2,2])
)
kable(results_comparison_ex, digits = 3)| Method | Coefficient | Std_Error | |
|---|---|---|---|
| exercise | OLS | 4.157 | 0.425 |
| exercise_hat | IV | 4.436 | 0.636 |
Education may be endogenous:
The Instrument: - Use distance to the nearest college as an instrument:
set.seed(789)
n <- 500
distance <- runif(n, 0, 50) # Instrument
education <- 12 + -0.1 * distance + rnorm(n, 0, 1) # Endogenous variable
earnings <- 20000 + 3000 * education + rnorm(n, 0, 5000) # Outcome
education_data <- data.frame(distance, education, earnings)
# First stage
first_stage_ed <- lm(education ~ distance, data = education_data)
education_data$education_hat <- predict(first_stage_ed)
# Second stage
second_stage_ed <- lm(earnings ~ education_hat, data = education_data)
# Visualize
p3 <- ggplot(education_data, aes(x = education, y = earnings, color = distance)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Education vs. Earnings",
x = "Years of Education",
y = "Earnings ($)",
color = "Distance to College") +
theme_minimal()
print(p3)# Compare OLS vs IV
ols_model_ed <- lm(earnings ~ education, data = education_data)
results_comparison_ed <- data.frame(
Method = c("OLS", "IV"),
Coefficient = c(coef(ols_model_ed)[2], coef(second_stage_ed)[2]),
Std_Error = c(summary(ols_model_ed)$coefficients[2,2],
summary(second_stage_ed)$coefficients[2,2])
)
kable(results_comparison_ed, digits = 3)| Method | Coefficient | Std_Error | |
|---|---|---|---|
| education | OLS | 3059.843 | 131.985 |
| education_hat | IV | 2932.271 | 188.557 |
Multicollinearity occurs when explanatory variables are highly correlated.
Example: Estimating the effect of advertising on sales with overlapping campaigns.
Generate sample data on advertising campaigns and sales:
# Simulate data
set.seed(456)
n <- 300
campaign1 <- rnorm(n, 100, 10) # Advertising campaign 1
campaign2 <- campaign1 + rnorm(n, 0, 5) # Highly correlated variable (campaign 2)
sales <- 5000 + 20 * campaign1 + 15 * campaign2 + rnorm(n, 0, 1000) # Outcome
multi_data <- data.frame(campaign1, campaign2, sales)# Model with multicollinearity
multi_model <- lm(sales ~ campaign1 + campaign2, data = multi_data)
beta2 <- coef(multi_model)[2]
beta3 <- coef(multi_model)[3]##
##
## | | vif_values|
## |:---------|----------:|
## |campaign1 | 4.918|
## |campaign2 | 4.918|
Results:
## [1] "\nMulticollinearity Model Summary:"
##
## Call:
## lm(formula = sales ~ campaign1 + campaign2, data = multi_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2943.79 -677.20 -53.97 700.33 2986.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4741.355 615.624 7.702 2.02e-13 ***
## campaign1 33.903 13.507 2.510 0.0126 *
## campaign2 3.567 12.405 0.288 0.7739
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1031 on 297 degrees of freedom
## Multiple R-squared: 0.1127, Adjusted R-squared: 0.1067
## F-statistic: 18.86 on 2 and 297 DF, p-value: 1.94e-08
Variance Inflation Factor (VIF) is a measure that detects multicollinearity in regression analysis.
Multicollinearity occurs when there are high correlations between independent variables, which can lead to unreliable and unstable estimates of regression coefficients.
VIF calculates how much the variance of a regression coefficient is inflated due to multicollinearity The formula for VIF for a variable \(j\) is: \[ VIF_j = \frac{1}{1-R_j^2} \]
where \(R^2_j\) is the R-squared value obtained by regressing the j-th predictor on all other predictors Interpretation of VIF values:
VIF = 1: No correlation between this independent variable and others VIF < 5: Generally acceptable VIF > 5 or 10: Problematic multicollinearity (some use 5 as cutoff, others use 10)
The higher the VIF, the more severe the multicollinearity.
Measurement error occurs when variables are measured with error.
Example: Estimating the effect of hours studied on test scores with noisy data.
# Simulate data
set.seed(789)
n <- 400
true_hours <- rnorm(n, 10, 2) # True hours studied
measured_hours <- true_hours + rnorm(n, 0, 1) # Measured with error
test_scores <- 50 + 5 * true_hours + rnorm(n, 0, 5) # Outcome
measurement_data <- data.frame(measured_hours, test_scores)# Model with measurement error
measurement_model <- lm(test_scores ~ measured_hours, data = measurement_data)Results:
##
##
## | |Model | Coefficient| Std_Error|
## |:--------------|:--------|-----------:|---------:|
## |measured_hours |Measured | 4.214| 0.150|
## |true_hours |True | 5.050| 0.127|
# Simulate data
set.seed(123)
n <- 300
true_hours <- rnorm(n, 10, 2) # True hours studied
measured_hours <- true_hours + rnorm(n, 0, 1) # Measured with error
test_scores <- 50 + 5 * true_hours + rnorm(n, 0, 5) # Outcome
measurement_data <- data.frame(true_hours, measured_hours, test_scores)
# Models
true_model <- lm(test_scores ~ true_hours, data = measurement_data)
measured_model <- lm(test_scores ~ measured_hours, data = measurement_data)##
##
## | |Model | Coefficient| Std_Error|
## |:--------------|:--------------|-----------:|---------:|
## |true_hours |True Hours | 4.877| 0.158|
## |measured_hours |Measured Hours | 3.853| 0.192|