Estimation Pitfalls in Econometrics

EC3133

Estimation Pitfalls

Why Do Estimators Fail?

Key Assumptions

  1. Linearity: The relationship between variables is linear.
  2. Exogeneity: The explanatory variables are uncorrelated with the error term.
  3. Homoskedasticity: The variance of the error term is constant.
  4. No Multicollinearity: Explanatory variables are not perfectly correlated.
  5. No Measurement Error: Variables are measured accurately.

What Are Misleading Inferences?

Why Does It Matter?

  1. Misleading inferences can arise from various econometric issues, including omitted variable bias, reverse causality, and multicollinearity.
  2. Always test assumptions and use appropriate methods (e.g., instrumental variables, diagnostics) to address these issues.
  3. Robust econometric analysis requires careful consideration of data quality and model specification.

Pitfall 1: Endogeneity

Omitted Variable Bias

Application: Education and Income

Generate sample data on education, income and ability:

Estimation

Using the generated data, let’s first estimate the effect of education on income without controlling for ability:

# Incorrect model (omitting ability)
incorrect_model <- lm(income ~ education, data = omitted_data)

Now do the same, but this time include both education and ability as regressors:

# Correct model (including ability)
correct_model <- lm(income ~ education + ability, data = data.frame(education, income, ability))

Compare results

## [1] "Omitted Variable Bias Results:"
## 
## 
## |Model               | Coefficient| Std_Error|
## |:-------------------|-----------:|---------:|
## |Incorrect (Omitted) |    8825.637|   224.598|
## |Correct             |    4746.707|   344.608|

Visualize results

A Solution to Endogeneity Problems: Instrumental Variables

How Does IV Work? Two-Stage Least Squares (2SLS)

  1. First Stage: Regress the endogenous variable on the instrument(s): \[ X = \pi_0 + \pi_1 Z + \epsilon \]
    • This isolates the variation in \[X\] that is uncorrelated with the error term.
  2. Second Stage: Regress the dependent variable on the predicted values of \[X\]: \[ Y = \beta_0 + \beta_1 \hat{X} + \ u \]
    • This gives an unbiased estimate of \[\beta_1\].

Application: Exercise and Happiness

The Endogeneity Problem

Using an instrument to address endogeneity

# Simulate data
set.seed(456)
n <- 300
discount <- rbinom(n, 1, 0.4) # Instrument
exercise <- 3 + 2 * discount + rnorm(n, 0, 1) # Endogenous variable
happiness <- 70 + 5 * exercise + rnorm(n, 0, 10) # Outcome

exercise_data <- data.frame(discount, exercise, happiness)

# First stage
first_stage_ex <- lm(exercise ~ discount, data = exercise_data)
exercise_data$exercise_hat <- predict(first_stage_ex)

# Second stage
second_stage_ex <- lm(happiness ~ exercise_hat, data = exercise_data)

Visualization

Compare OLS vs IV

# Compare OLS vs IV
ols_model_ex <- lm(happiness ~ exercise, data = exercise_data)
results_comparison_ex <- data.frame(
  Method = c("OLS", "IV"),
  Coefficient = c(coef(ols_model_ex)[2], coef(second_stage_ex)[2]),
  Std_Error = c(summary(ols_model_ex)$coefficients[2,2],
                summary(second_stage_ex)$coefficients[2,2])
)
kable(results_comparison_ex, digits = 3)
Method Coefficient Std_Error
exercise OLS 4.157 0.425
exercise_hat IV 4.436 0.636

Application: Education and Income Using an instrument to address endogeneity

The Instrument: - Use distance to the nearest college as an instrument:

set.seed(789)
n <- 500
distance <- runif(n, 0, 50) # Instrument
education <- 12 + -0.1 * distance + rnorm(n, 0, 1) # Endogenous variable
earnings <- 20000 + 3000 * education + rnorm(n, 0, 5000) # Outcome

education_data <- data.frame(distance, education, earnings)

# First stage
first_stage_ed <- lm(education ~ distance, data = education_data)
education_data$education_hat <- predict(first_stage_ed)

# Second stage
second_stage_ed <- lm(earnings ~ education_hat, data = education_data)

# Visualize
p3 <- ggplot(education_data, aes(x = education, y = earnings, color = distance)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Education vs. Earnings",
       x = "Years of Education",
       y = "Earnings ($)",
       color = "Distance to College") +
  theme_minimal()
print(p3)

Compare OLS vs IV

# Compare OLS vs IV
ols_model_ed <- lm(earnings ~ education, data = education_data)
results_comparison_ed <- data.frame(
  Method = c("OLS", "IV"),
  Coefficient = c(coef(ols_model_ed)[2], coef(second_stage_ed)[2]),
  Std_Error = c(summary(ols_model_ed)$coefficients[2,2],
                summary(second_stage_ed)$coefficients[2,2])
)
kable(results_comparison_ed, digits = 3)
Method Coefficient Std_Error
education OLS 3059.843 131.985
education_hat IV 2932.271 188.557

Pitfall 2: Multicollinearity (Advertising and Sales)

Generate sample data on advertising campaigns and sales:

# Simulate data
set.seed(456)
n <- 300
campaign1 <- rnorm(n, 100, 10) # Advertising campaign 1
campaign2 <- campaign1 + rnorm(n, 0, 5) # Highly correlated variable (campaign 2)
sales <- 5000 + 20 * campaign1 + 15 * campaign2 + rnorm(n, 0, 1000) # Outcome

multi_data <- data.frame(campaign1, campaign2, sales)
# Model with multicollinearity
multi_model <- lm(sales ~ campaign1 + campaign2, data = multi_data)

beta2 <- coef(multi_model)[2]
beta3 <- coef(multi_model)[3]
## 
## 
## |          | vif_values|
## |:---------|----------:|
## |campaign1 |      4.918|
## |campaign2 |      4.918|

Results:

## [1] "\nMulticollinearity Model Summary:"
## 
## Call:
## lm(formula = sales ~ campaign1 + campaign2, data = multi_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2943.79  -677.20   -53.97   700.33  2986.17 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4741.355    615.624   7.702 2.02e-13 ***
## campaign1     33.903     13.507   2.510   0.0126 *  
## campaign2      3.567     12.405   0.288   0.7739    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1031 on 297 degrees of freedom
## Multiple R-squared:  0.1127, Adjusted R-squared:  0.1067 
## F-statistic: 18.86 on 2 and 297 DF,  p-value: 1.94e-08

Variance Inflation Factor (VIF)

where \(R^2_j\) is the R-squared value obtained by regressing the j-th predictor on all other predictors Interpretation of VIF values:

VIF = 1: No correlation between this independent variable and others VIF < 5: Generally acceptable VIF > 5 or 10: Problematic multicollinearity (some use 5 as cutoff, others use 10)

The higher the VIF, the more severe the multicollinearity.

Pitfall 3: Measurement Error

# Simulate data
set.seed(789)
n <- 400
true_hours <- rnorm(n, 10, 2) # True hours studied
measured_hours <- true_hours + rnorm(n, 0, 1) # Measured with error
test_scores <- 50 + 5 * true_hours + rnorm(n, 0, 5) # Outcome

measurement_data <- data.frame(measured_hours, test_scores)
# Model with measurement error
measurement_model <- lm(test_scores ~ measured_hours, data = measurement_data)

Results:

Compare true vs. measured coefficients:

## 
## 
## |               |Model    | Coefficient| Std_Error|
## |:--------------|:--------|-----------:|---------:|
## |measured_hours |Measured |       4.214|     0.150|
## |true_hours     |True     |       5.050|     0.127|

Visualization of Measurement Error

More on Measurement Error: What is it?

  1. Measurement error can severely bias estimates and reduce statistical power.
  2. Classical measurement error leads to attenuation bias, while systematic error can cause unpredictable bias.
  3. Always assess the quality of your data and consider methods to correct for measurement error (e.g., instrumental variables).

Intuition: Why Does It Matter?

Types of Measurement Error

Classical Measurement Error

Systematic Measurement Error

Example: Hours Studied and Test Scores

# Simulate data
set.seed(123)
n <- 300
true_hours <- rnorm(n, 10, 2) # True hours studied
measured_hours <- true_hours + rnorm(n, 0, 1) # Measured with error
test_scores <- 50 + 5 * true_hours + rnorm(n, 0, 5) # Outcome

measurement_data <- data.frame(true_hours, measured_hours, test_scores)

# Models
true_model <- lm(test_scores ~ true_hours, data = measurement_data)
measured_model <- lm(test_scores ~ measured_hours, data = measurement_data)

Compare

## 
## 
## |               |Model          | Coefficient| Std_Error|
## |:--------------|:--------------|-----------:|---------:|
## |true_hours     |True Hours     |       4.877|     0.158|
## |measured_hours |Measured Hours |       3.853|     0.192|

Things to Remember

  1. Econometric estimators are powerful but rely on strong assumptions.
  2. Violating assumptions can lead to biased, inconsistent, or inefficient estimates.
  3. Always test assumptions and use diagnostics to validate your models.