Simulated example of controlling for confounding in a multivariable regression

Simulate data

W <- rbinom(n=1000, size=1, prob=0.5)
X <- rnorm(1000) + 2*W
Y <- X + W + rnorm(1000)

d <- data.frame(Y,X,W) %>% arrange(X)
d$W <- factor(d$W)
head(d)

##           Y         X W
## 1 -3.547570 -3.334675 0
## 2 -3.472954 -2.778326 0
## 3 -2.910878 -2.651398 0
## 4 -2.068731 -2.560052 0
## 5 -2.554195 -2.555627 0
## 6 -2.545519 -2.546475 0

Don’t worry about the R code here and in the rest of the document. But what we’ve done is simulate a binary W, and a continious X where mean X is on average 2 higher when W=0 than when W=1. We then simulate Y where the mean difference per unit of X is 1, and the mean difference between W=1 and W=0 is also 1. So W is associated with X and also with Y, so is a confounder.

Fit bivariable and multivariable regressions

#Only include X in the crude model
res_crude <- lm(Y~X, data=d)

#Include X and Y in the adjusted model
res_adjusted <- lm(Y~X+W, data=d)

Examine fit models

summary(res_crude)

## 
## Call:
## lm(formula = Y ~ X, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8839 -0.7018  0.0095  0.7104  3.0086 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.24349    0.04077   5.972 3.25e-09 ***
## X            1.24291    0.02312  53.767  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.038 on 998 degrees of freedom
## Multiple R-squared:  0.7434, Adjusted R-squared:  0.7431 
## F-statistic:  2891 on 1 and 998 DF,  p-value: < 2.2e-16

summary(res_adjusted)

## 
## Call:
## lm(formula = Y ~ X + W, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.11821 -0.64826  0.01675  0.65773  2.77881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01765    0.04534  -0.389    0.697    
## X            1.00460    0.03086  32.557   <2e-16 ***
## W1           0.96088    0.08784  10.939   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9817 on 997 degrees of freedom
## Multiple R-squared:  0.7709, Adjusted R-squared:  0.7704 
## F-statistic:  1677 on 2 and 997 DF,  p-value: < 2.2e-16

Note the coefficient on X when not including W in the model is 1.2, which is confounded. As we simulated the data, we know the mean difference in Y per unit increase in X is 1. The coefficient on X is correctly 1 when we include W in the multivariable model.

Get predictions of Y from both models

d$predY_crude <- predict(res_crude, type="response")
d$predY_adj_strat <- predict(res_adjusted, type="response")

Don’t worry about the details of this. We are simply getting predicted Y for each X so we can graph the regression lines.

Examine the slope of the relationship between X and Y, not adjusting for W

ggplot(d, aes(x=X, y=Y)) +
    geom_point(alpha=0.5) +
    geom_line(aes(y=predY_crude), size=1.5)

Here are observations plotted with the exposure X on the X=axis and the outcome Y on the Y-axis. The regression line (not adjusting for W) is plotted, and notice that the regression line fits well through the cloud of points. However, the slope is 1.2, which is not the true relationship between X and Y. We simulated the data so Y increases (on average) by 1 per 1-unit increase in X.

Examine the slope of the relationship between X and Y, stratified by W

ggplot(d, aes(x=X, y=Y, group=W, color=W)) +
  geom_point(alpha=0.5) +
  scale_color_manual(values = tableau10) +
  geom_line(aes(y=predY_crude), color="black", size=1.5) +
  geom_line(aes(y=predY_adj_strat), size=1.5)

Here, we fit the lines summarizing the association between X and Y stratified by the W variable (blue is among those with W=0, orange is among W=1). The black line is the crude, confounded association. So note the gap between the stratified lines is ~1, which is the mean difference in Y between W=0 and W=1 (and we know this between we simulated the data). Also note that the slope of these stratified lines are 1, not 1.2, because we have stratified by the confounder to remove the confounding. Also note the distribution of orange and blue points. W is a confounder because those with W=1 have both higher X and higher Y. Before stratification, this makes the association (slope) between X and Y higher (steeper) than the truth.

The stratified lines have the same slope of 1 (because we haven’t included an interaction term between X and W, so the slope of X can’t change based on value of W), but the slope is different from the confounded estimate of 1.2. So the Beta_1 coefficient in the adjusted model is this stratified slope of X of 1, AKA the estimated association after removing confounding from W by stratifying.

So note how including a confounder in a multivariate regression changes the estimated relationship between X and Y. And because we simulated this data and know the truth, we see including W removes confounding. In this example, including W in the regression model is very similiar to stratifying by W and then fitting two different regression lines between the stratified datasets, which, conceptually, is the same as if we stratify our 2x2 table by the confounder to remove confounding. However, in a regression context, we could include a continious W, or many W variables. To use stratification instead of regression, we’d have to categorize any continious variables, and stratifying by many variables is 1) complex, and 2) leads to sparse strata, so the regression approach is more appealing.

An example of more extreme confounding that reverses the direction of association

How W acts as a confounder might be more clear in an extreme example. Here, W=0 is associated with a lower X, and a much higher Y. Now, if you don’t stratify (in the plot. In reality we’re including W in the regression model), you see a negative association between X and Y when the true association is positive.

W <- rbinom(n=1000, size=1, prob=0.5)
X <- rnorm(1000) + 2*W
Y <- X - 10*W + rnorm(1000)

d <- data.frame(Y,X,W) %>% arrange(X)
d$W <- factor(d$W)
res_crude <- lm(Y~X, data=d)
res_adjusted <- lm(Y~X+W, data=d)

d$predY_crude <- predict(res_crude, type="response")
d$predY_adj_strat <- predict(res_adjusted, type="response")

ggplot(d, aes(x=X, y=Y, group=W, color=W)) +
  geom_point(alpha=0.5) +
  scale_color_manual(values = tableau10) +
  geom_line(aes(y=predY_crude), color="black", size=1.5) +
  geom_line(aes(y=predY_adj_strat), size=1.5)