Econometric Insights (Part I)

EC3133

The Big Picture

Understanding Empirical Analysis:

What we will see:

Understanding Covariance

Covariance Formula

The Math

\[ Cov(X,Y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n} \]

In English:

From Covariance to Beta

The Formula and Why It Makes Sense

\[ \beta_1 = \frac{Cov(X,Y)}{Var(X)} \]

The Same Thing in Matrix Form

OLS Formula:

\[ \beta = (X'X)^{-1}X'y \]

This is Just:

Geometric View of Regression

Why Projection?

Key Insights:

The Projection Matrix

The Matrix Formula:

\[ P = X(X'X)^{-1}X' \]

What It Does:

Visualizing Residuals

The Complete Picture

Regression Is:

  1. Finding how variables move together (Covariance)
  2. Adjusting for individual movement (Variance)
  3. Projecting onto the best fitting line/plane

All These Views Are Equivalent:

Practical Example: Height vs Weight

The Big Question

How do we know what control variables we should include in the estimation equation?

Choosing Control Variables in Multivariate Regression: A Practical Guide

The Problem with Over-Controlling

Key Issues:

  1. Multicollinearity: Variables that are highly correlated can distort results
  2. Overfitting: Too many controls can make your model too specific to your data
  3. Bias Introduction: Controlling for variables that are outcomes of your treatment can bias your estimates

Example: Studying Exercise and Weight Loss

Scenario:

Question:

Visualizing the Problem

Should You Control for Calories?

Key Considerations:

  1. Is it a confounder?
    • Does it affect both exercise and weight loss?
  2. Is it a mediator?
    • Is it part of the causal pathway from exercise to weight loss?

Answer:

The DAG Approach

Directed Acyclic Graphs (DAGs):

Practical Steps

Step 1: Identify Confounders

Step 2: Avoid Controlling for Mediators

Step 3: Be Careful with Colliders

Example: Studying Education and Income

Scenario:

Question:

Visualizing the Relationships

Should You Control for IQ?

Key Considerations:

  1. Is it a confounder?
    • Does it affect both education and income?
  2. Is it a mediator?
    • Is it part of the causal pathway from education to income?

Answer:

The meaning of “Mediator”

Analyze the data

Using the weight loss example, let’s look more carefully into what a mediator means.

The Causal Chain:

Exercise → Calorie Intake → Weight Loss

Exercise also directly affects Weight Loss

We will fit two models to the data on weight loss: One without controlling for the mediator and one with it.

model_no_control <- lm(weight_loss ~ exercise, data = data_mediator)
summary_no_control <- summary(model_no_control)
model_with_control <- lm(weight_loss ~ exercise + calories, data = data_mediator)
summary_with_control <- summary(model_with_control)

Regression results WITH and WITHOUT controlling for the mediator (calories):

## $no_control
## 
## Call:
## lm(formula = weight_loss ~ exercise, data = data_mediator)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72024 -0.74405 -0.04747  0.64365  2.62107 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.54552    0.31834   -61.4   <2e-16 ***
## exercise      2.94591    0.05798    50.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.053 on 98 degrees of freedom
## Multiple R-squared:  0.9634, Adjusted R-squared:  0.963 
## F-statistic:  2581 on 1 and 98 DF,  p-value: < 2.2e-16
## 
## 
## $with_control
## 
## Call:
## lm(formula = weight_loss ~ exercise + calories, data = data_mediator)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8730 -0.6607 -0.1245  0.6214  2.0798 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.484457   3.973014  -0.122    0.903    
## exercise     1.981037   0.207310   9.556 1.22e-15 ***
## calories    -0.009524   0.001980  -4.810 5.52e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9513 on 97 degrees of freedom
## Multiple R-squared:  0.9705, Adjusted R-squared:  0.9699 
## F-statistic:  1594 on 2 and 97 DF,  p-value: < 2.2e-16

Interpret the data

This reveals several important insights about mediators:

Total Effect vs Direct Effect:

Without controlling for calories: Exercise coefficient = 2.95

With controlling for calories: Exercise coefficient = 1.98

The difference occurs because controlling for calories blocks the indirect pathway through calorie reduction

Why This Matters:

If we want to know the total effect of exercise on weight loss, we should NOT control for calories

The total effect (2.95) includes both:

Direct effect of exercise (burning calories during workout)

Indirect effect (reducing appetite/calorie intake)

Visualization of the Mediation:

Summary

How do you decide?

  1. Use DAGs to map relationships

  2. Control for confounders

  3. Avoid controlling for mediators

  4. Be cautious with colliders

Why does it matter?