EC3133
Understanding Empirical Analysis:
What we will see:
\[ Cov(X,Y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n} \]
\[ \beta_1 = \frac{Cov(X,Y)}{Var(X)} \]
\[ \beta = (X'X)^{-1}X'y \]
\[ P = X(X'X)^{-1}X' \]
How do we know what control variables we should include in the estimation equation?
Why you should NOT control for everything
How to decide what to control for
Using the weight loss example, let’s look more carefully into what a mediator means.
The Causal Chain:
Exercise → Calorie Intake → Weight Loss
Exercise also directly affects Weight Loss
We will fit two models to the data on weight loss: One without controlling for the mediator and one with it.
model_no_control <- lm(weight_loss ~ exercise, data = data_mediator)
summary_no_control <- summary(model_no_control)model_with_control <- lm(weight_loss ~ exercise + calories, data = data_mediator)
summary_with_control <- summary(model_with_control)Regression results WITH and WITHOUT controlling for the mediator (calories):
## $no_control
##
## Call:
## lm(formula = weight_loss ~ exercise, data = data_mediator)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.72024 -0.74405 -0.04747 0.64365 2.62107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.54552 0.31834 -61.4 <2e-16 ***
## exercise 2.94591 0.05798 50.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.053 on 98 degrees of freedom
## Multiple R-squared: 0.9634, Adjusted R-squared: 0.963
## F-statistic: 2581 on 1 and 98 DF, p-value: < 2.2e-16
##
##
## $with_control
##
## Call:
## lm(formula = weight_loss ~ exercise + calories, data = data_mediator)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8730 -0.6607 -0.1245 0.6214 2.0798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.484457 3.973014 -0.122 0.903
## exercise 1.981037 0.207310 9.556 1.22e-15 ***
## calories -0.009524 0.001980 -4.810 5.52e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9513 on 97 degrees of freedom
## Multiple R-squared: 0.9705, Adjusted R-squared: 0.9699
## F-statistic: 1594 on 2 and 97 DF, p-value: < 2.2e-16
This reveals several important insights about mediators:
Total Effect vs Direct Effect:
Without controlling for calories: Exercise coefficient = 2.95
With controlling for calories: Exercise coefficient = 1.98
The difference occurs because controlling for calories blocks the indirect pathway through calorie reduction
Why This Matters:
If we want to know the total effect of exercise on weight loss, we should NOT control for calories
The total effect (2.95) includes both:
Direct effect of exercise (burning calories during workout)
Indirect effect (reducing appetite/calorie intake)
How do you decide?
Use DAGs to map relationships
Control for confounders
Avoid controlling for mediators
Be cautious with colliders
Why does it matter?
Controlling for the wrong variables can bias your results
Thoughtful selection improves the validity of your model