Estimation Pitfalls in Econometrics

EC3133

Estimation Pitfalls

Why Do Estimators Fail?

Econometric estimators rely on assumptions to provide unbiased and consistent results.
When these assumptions fail, the results can be misleading or outright wrong.

Key Assumptions

Linearity: The relationship between variables is linear.
Exogeneity: The explanatory variables are uncorrelated with the error term.
Homoskedasticity: The variance of the error term is constant.
No Multicollinearity: Explanatory variables are not perfectly correlated.
No Measurement Error: Variables are measured accurately.

What Are Misleading Inferences?

Misleading inferences occur when econometric results lead to incorrect conclusions.
Common causes include:
1. Omitted variable bias.
2. Reverse causality.
3. Multicollinearity.
4. Measurement error.

Why Does It Matter?

Misleading inferences can result in poor policy decisions, incorrect business strategies, and flawed scientific conclusions.

Misleading inferences can arise from various econometric issues, including omitted variable bias, reverse causality, and multicollinearity.
Always test assumptions and use appropriate methods (e.g., instrumental variables, diagnostics) to address these issues.
Robust econometric analysis requires careful consideration of data quality and model specification.

Pitfall 1: Endogeneity

Endogeneity occurs when an explanatory variable is correlated with the error term.
Example: Measuring the effect of education on income:
- Ability (unobserved) affects both education and income.

Omitted Variable Bias

Omitted variable bias occurs when a relevant variable is left out of the model.

Application: Education and Income

Example:
Suppose we want to estimate the effect of education on income.
If we omit ability (a key determinant of income), our estimate will be biased.
Estimating the effect of education on income without accounting for ability.

Generate sample data on education, income and ability:

Estimation

Using the generated data, let’s first estimate the effect of education on income without controlling for ability:

# Incorrect model (omitting ability)
incorrect_model <- lm(income ~ education, data = omitted_data)

Now do the same, but this time include both education and ability as regressors:

# Correct model (including ability)
correct_model <- lm(income ~ education + ability, data = data.frame(education, income, ability))

Compare results

## [1] "Omitted Variable Bias Results:"

## 
## 
## |Model               | Coefficient| Std_Error|
## |:-------------------|-----------:|---------:|
## |Incorrect (Omitted) |    8825.637|   224.598|
## |Correct             |    4746.707|   344.608|

Visualize results

A Solution to Endogeneity Problems: Instrumental Variables

An instrument is a variable that:
1. Is correlated with the endogenous explanatory variable.
2. Is not correlated with the error term (exogenous).

How Does IV Work? Two-Stage Least Squares (2SLS)

First Stage: Regress the endogenous variable on the instrument(s): \[ X = \pi_0 + \pi_1 Z + \epsilon \]
- This isolates the variation in \[X\] that is uncorrelated with the error term.
Second Stage: Regress the dependent variable on the predicted values of \[X\]: \[ Y = \beta_0 + \beta_1 \hat{X} + \ u \]
- This gives an unbiased estimate of \[\beta_1\].

Application: Exercise and Happiness

The Endogeneity Problem

Does exercising make people happier?
If we look at the data, it certainly looks like exercising is positively correlated with happiness. Does this mean if I take any random person on the street and increase their frequency of exercise, that person will be happier? No!
Exercise may be endogenous: Happy people may exercise more.
So we probably observe this positive correlation between exercise and happiness because:
1. Exercise makes people happier.
2. Happier people exercise more.
How do we distinguish between these two scenarios?

Using an instrument to address endogeneity

Use gym membership discounts as an instrument:
- Discounts encourage exercise but are unrelated to happiness.

# Simulate data
set.seed(456)
n <- 300
discount <- rbinom(n, 1, 0.4) # Instrument
exercise <- 3 + 2 * discount + rnorm(n, 0, 1) # Endogenous variable
happiness <- 70 + 5 * exercise + rnorm(n, 0, 10) # Outcome

exercise_data <- data.frame(discount, exercise, happiness)

# First stage
first_stage_ex <- lm(exercise ~ discount, data = exercise_data)
exercise_data$exercise_hat <- predict(first_stage_ex)

# Second stage
second_stage_ex <- lm(happiness ~ exercise_hat, data = exercise_data)

Visualization

Compare OLS vs IV

# Compare OLS vs IV
ols_model_ex <- lm(happiness ~ exercise, data = exercise_data)
results_comparison_ex <- data.frame(
  Method = c("OLS", "IV"),
  Coefficient = c(coef(ols_model_ex)[2], coef(second_stage_ex)[2]),
  Std_Error = c(summary(ols_model_ex)$coefficients[2,2],
                summary(second_stage_ex)$coefficients[2,2])
)
kable(results_comparison_ex, digits = 3)

	Method	Coefficient	Std_Error
exercise	OLS	4.157	0.425
exercise_hat	IV	4.436	0.636

Application: Education and Income Using an instrument to address endogeneity

Education may be endogenous:
- Ability (unobserved) affects both education and earnings.

The Instrument: - Use distance to the nearest college as an instrument:

Living closer to a college increases the likelihood of attending but is unrelated to ability.

set.seed(789)
n <- 500
distance <- runif(n, 0, 50) # Instrument
education <- 12 + -0.1 * distance + rnorm(n, 0, 1) # Endogenous variable
earnings <- 20000 + 3000 * education + rnorm(n, 0, 5000) # Outcome

education_data <- data.frame(distance, education, earnings)

# First stage
first_stage_ed <- lm(education ~ distance, data = education_data)
education_data$education_hat <- predict(first_stage_ed)

# Second stage
second_stage_ed <- lm(earnings ~ education_hat, data = education_data)

# Visualize
p3 <- ggplot(education_data, aes(x = education, y = earnings, color = distance)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Education vs. Earnings",
       x = "Years of Education",
       y = "Earnings ($)",
       color = "Distance to College") +
  theme_minimal()
print(p3)

Compare OLS vs IV

# Compare OLS vs IV
ols_model_ed <- lm(earnings ~ education, data = education_data)
results_comparison_ed <- data.frame(
  Method = c("OLS", "IV"),
  Coefficient = c(coef(ols_model_ed)[2], coef(second_stage_ed)[2]),
  Std_Error = c(summary(ols_model_ed)$coefficients[2,2],
                summary(second_stage_ed)$coefficients[2,2])
)
kable(results_comparison_ed, digits = 3)

	Method	Coefficient	Std_Error
education	OLS	3059.843	131.985
education_hat	IV	2932.271	188.557

Pitfall 2: Multicollinearity (Advertising and Sales)

Multicollinearity occurs when explanatory variables are highly correlated.
Example: Estimating the effect of advertising on sales with overlapping campaigns.

Generate sample data on advertising campaigns and sales:

# Simulate data
set.seed(456)
n <- 300
campaign1 <- rnorm(n, 100, 10) # Advertising campaign 1
campaign2 <- campaign1 + rnorm(n, 0, 5) # Highly correlated variable (campaign 2)
sales <- 5000 + 20 * campaign1 + 15 * campaign2 + rnorm(n, 0, 1000) # Outcome

multi_data <- data.frame(campaign1, campaign2, sales)

# Model with multicollinearity
multi_model <- lm(sales ~ campaign1 + campaign2, data = multi_data)

beta2 <- coef(multi_model)[2]
beta3 <- coef(multi_model)[3]

## 
## 
## |          | vif_values|
## |:---------|----------:|
## |campaign1 |      4.918|
## |campaign2 |      4.918|

Results:

## [1] "\nMulticollinearity Model Summary:"

## 
## Call:
## lm(formula = sales ~ campaign1 + campaign2, data = multi_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2943.79  -677.20   -53.97   700.33  2986.17 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4741.355    615.624   7.702 2.02e-13 ***
## campaign1     33.903     13.507   2.510   0.0126 *  
## campaign2      3.567     12.405   0.288   0.7739    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1031 on 297 degrees of freedom
## Multiple R-squared:  0.1127, Adjusted R-squared:  0.1067 
## F-statistic: 18.86 on 2 and 297 DF,  p-value: 1.94e-08

Variance Inflation Factor (VIF)

Variance Inflation Factor (VIF) is a measure that detects multicollinearity in regression analysis.
Multicollinearity occurs when there are high correlations between independent variables, which can lead to unreliable and unstable estimates of regression coefficients.
VIF calculates how much the variance of a regression coefficient is inflated due to multicollinearity The formula for VIF for a variable \(j\) is: \[ VIF_j = \frac{1}{1-R_j^2} \]

where \(R^2_j\) is the R-squared value obtained by regressing the j-th predictor on all other predictors Interpretation of VIF values:

VIF = 1: No correlation between this independent variable and others VIF < 5: Generally acceptable VIF > 5 or 10: Problematic multicollinearity (some use 5 as cutoff, others use 10)

The higher the VIF, the more severe the multicollinearity.

Pitfall 3: Measurement Error

Measurement error occurs when variables are measured with error.
Example: Estimating the effect of hours studied on test scores with noisy data.

# Simulate data
set.seed(789)
n <- 400
true_hours <- rnorm(n, 10, 2) # True hours studied
measured_hours <- true_hours + rnorm(n, 0, 1) # Measured with error
test_scores <- 50 + 5 * true_hours + rnorm(n, 0, 5) # Outcome

measurement_data <- data.frame(measured_hours, test_scores)

# Model with measurement error
measurement_model <- lm(test_scores ~ measured_hours, data = measurement_data)

Results:

Compare true vs. measured coefficients:

## 
## 
## |               |Model    | Coefficient| Std_Error|
## |:--------------|:--------|-----------:|---------:|
## |measured_hours |Measured |       4.214|     0.150|
## |true_hours     |True     |       5.050|     0.127|

Visualization of Measurement Error

More on Measurement Error: What is it?

Measurement error occurs when the observed value of a variable differs from its true value.
It can arise due to:
- Imperfect measurement tools.
- Human error.
- Random noise.

Measurement error can severely bias estimates and reduce statistical power.
Classical measurement error leads to attenuation bias, while systematic error can cause unpredictable bias.
Always assess the quality of your data and consider methods to correct for measurement error (e.g., instrumental variables).

Intuition: Why Does It Matter?

Measurement error can lead to:
1. Attenuation Bias: Underestimation of coefficients.
2. Loss of Statistical Power: Larger standard errors.
3. Misleading Inferences: Incorrect conclusions.

Types of Measurement Error

Classical Measurement Error

The observed variable is the true value plus random noise: \[ X_{observed} = X_{true} + \epsilon \]
Assumptions:
1. Noise is uncorrelated with the true value.
2. Noise has a mean of zero.
The observed variable is a noisy version of the true variable.
This noise reduces the correlation between the observed variable and the outcome.
As a result, the estimated coefficient is biased toward zero (attenuation bias).

Systematic Measurement Error

The observed variable consistently over- or underestimates the true value: \[ X_{observed} = X_{true} + b \]
Example: A scale that always adds 2 kg to the weight.
If the error is systematic, the bias depends on the direction and magnitude of the error.
Example: Overestimating hours studied leads to overestimating the effect on test scores.

Example: Hours Studied and Test Scores

Suppose we want to estimate the effect of hours studied on test scores.
If hours studied are measured with error, our estimate will be biased.

# Simulate data
set.seed(123)
n <- 300
true_hours <- rnorm(n, 10, 2) # True hours studied
measured_hours <- true_hours + rnorm(n, 0, 1) # Measured with error
test_scores <- 50 + 5 * true_hours + rnorm(n, 0, 5) # Outcome

measurement_data <- data.frame(true_hours, measured_hours, test_scores)

# Models
true_model <- lm(test_scores ~ true_hours, data = measurement_data)
measured_model <- lm(test_scores ~ measured_hours, data = measurement_data)

Compare

## 
## 
## |               |Model          | Coefficient| Std_Error|
## |:--------------|:--------------|-----------:|---------:|
## |true_hours     |True Hours     |       4.877|     0.158|
## |measured_hours |Measured Hours |       3.853|     0.192|

Things to Remember

Econometric estimators are powerful but rely on strong assumptions.
Violating assumptions can lead to biased, inconsistent, or inefficient estimates.
Always test assumptions and use diagnostics to validate your models.