# Set the seed for reproducibility
set.seed(42)
# Define the sample size
n <- 10000
# Create predictor variables
X1 <- rnorm(n)
X2 <- X1 + rnorm(n, sd = 0.5)
X3 <- -2 * X1 + X2 + rnorm(n, sd = 0.5)
# Create outcome variable
Y <- 3 * X1 + 2 * X3 + rnorm(n, sd = 1)
# Model 1: Y ~ X1
model1 <- lm(Y ~ X1)
cat("--- Model 1: Y ~ X1 ---\n")
## --- Model 1: Y ~ X1 ---
summary(model1)
##
## Call:
## lm(formula = Y ~ X1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4359 -1.1885 0.0037 1.1792 6.0775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006848 0.017520 0.391 0.696
## X1 0.994802 0.017413 57.130 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.752 on 9998 degrees of freedom
## Multiple R-squared: 0.2461, Adjusted R-squared: 0.246
## F-statistic: 3264 on 1 and 9998 DF, p-value: < 2.2e-16
# Model 2: Y ~ X1 + X2
model2 <- lm(Y ~ X1 + X2)
cat("--- Model 2: Y ~ X1 + X2 ---\n")
## --- Model 2: Y ~ X1 + X2 ---
summary(model2)
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5865 -0.9641 -0.0003 0.9678 5.6430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006103 0.014242 0.429 0.668
## X1 -1.022905 0.031520 -32.452 <2e-16 ***
## X2 2.025091 0.028266 71.643 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.424 on 9997 degrees of freedom
## Multiple R-squared: 0.5019, Adjusted R-squared: 0.5018
## F-statistic: 5036 on 2 and 9997 DF, p-value: < 2.2e-16
# Model 3: Y ~ X1 + X2 + X3
model3 <- lm(Y ~ X1 + X2 + X3)
cat("--- Model 3: Y ~ X1 + X2 + X3 ---\n")
## --- Model 3: Y ~ X1 + X2 + X3 ---
summary(model3)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6517 -0.6830 0.0128 0.6771 3.8700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.001927 0.010089 -0.191 0.849
## X1 2.962148 0.045803 64.672 <2e-16 ***
## X2 0.021449 0.028376 0.756 0.450
## X3 1.982808 0.019899 99.644 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.009 on 9996 degrees of freedom
## Multiple R-squared: 0.7501, Adjusted R-squared: 0.75
## F-statistic: 1e+04 on 3 and 9996 DF, p-value: < 2.2e-16
The example shows omitted variable bias (OVB) caused by confounding. The coefficient of X₁ changes significantly (from +0.953 to −1.091 to +3.121), as simpler models capture the effects of omitted variables X₂ and X₃, which are correlated with both X₁ and Y. OVB occurs when relevant variables influencing Y and correlated with predictors are omitted. When X₂ and X₃ are included, the regression estimates the effect of each predictor on Y, holding others constant, which removes bias and stabilizes X₁’s coefficient. This explains the sign changes and more accurate partial effect.
For causal inference, simple models produce biased estimates of X₁’s effect, while the more comprehensive model produces less biased estimates. For prediction, the expanded models perform better, indicated by a higher R² (0.128 to 0.4271) and lower Residual Standard Error, reflecting improved accuracy compared to simpler models. The changes in coefficients highlight the challenge of deriving causal relationships from observational data. In Randomized Controlled Trials (RCTs), randomization ensures predictor variables are uncorrelated with omitted variables, preventing severe OVB. Therefore, the coefficient for X₁ remains stable regardless of whether X₂ and X₃ are included.