Goodness of Fit, Confounders, and Power - a Revisit

This is a short vignette that reviews some issues associated with confounders, \(R^2\), statistical power, and model building.

In this first chunk, I am simulating some data. I am simulating data on Oreo(™) consumption and a health outcome. We are going to assume that the health outcome is scaled such that higher values are “bad.” Our research question is whether eating Oreos is bad for you.

You don’t need to follow this data generation, unless you want to. The bottom line is that Oreo eating is correlated with the higher health outcome. But, Oreo eating is determined substantially by some genetic trait.

That same genetic trait also is associated with the higher health outcome.

So, genetics is a confounder. It is correlated with both the outcome and the predictor (treatment, if you will) of interest.

Now, there is another variable here called “weight.” In this world, Oreos and weight are not correlated with one another (although I admit that is a bit of a stretch). But, weight is very much associated with the outcome. So, weight is not a confounder for the Oreo –> Outcome relationship, but it is itself a significant predictor of the outcome.

Disclaimer: I wrote this simulation with the help of Google Gemini. It was fun! It gave a me several wrong versions of the simulation, and I taught it to do better. Then, once it was close, I went in and fixed a few things.

Here’s the data generation.

# Simulate data
set.seed(123)  # Set seed for reproducibility

# Number of individuals
n <- 1000

# Exposure (oreos)
oreos <- rbinom(n, 1, 0.3)  # 0 (non-oreo-eater), 1 (oreo-eater)

# Confounder (genetics) More likely for oreo_eaters

genetics <- rep(NA,length(oreos))

for(i in 1:length(oreos)){
  if(oreos[i]==1){genetics[i]<-rbinom(1,1,0.9)}
  if(oreos[i]==0){genetics[i]<-rbinom(1,1,0.1)}
}

# Another variable that also has an effect on health, let's call it weight
# this is uncorrelated with oreos or weight
weight <- runif(1000,10,30)

# True effect of oreos (weak effect)
true_effect_oreos <- 1

# True effect of genetics (strong effect)
true_effect_genetics <- 15

# Outcome (Health measure) - function of several things
outcome <- rnorm(n, 
                 mean = 80 + ## 80 is just the baseline mean, it's arbitrary
                   genetics + 
                   oreos * true_effect_oreos + 
                   genetics * true_effect_genetics + 
                   weight, 
                 sd = 10)

# make a data frame

dat<-data.frame(oreos,
                outcome, 
                weight, 
                genetics)

Let’s prove to ourselves there are some correlations here.

Outcome and Oreos are correlated.

cor(dat$outcome,dat$oreos)

[1] 0.4517602

Outcome and weight are correlated.

cor(dat$outcome,dat$weight)

[1] 0.4602018

Outcome and genetics are correlated.

cor(dat$outcome,dat$genetics)

[1] 0.5708168

Genetics and Oreos are correlated.

cor(dat$genetics,dat$oreos)

[1] 0.8051312

Weight and genetics are not correlated.

cor(dat$weight,dat$genetics)

[1] 0.04960281

Oreos and weight are not correlated.

cor(dat$oreos,dat$weight)

[1] 0.04041539

Now, we are will build some models and look at the \(R^2\) values.

First, just the bivariate relationship between the outcome and Oreos.

Highly statistically significant for Oreos, low-ish R2.

lm_unadjusted <- lm(outcome ~ oreos, data = dat)
summary(lm_unadjusted)


Call:
lm(formula = outcome ~ oreos, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.462  -8.595  -0.480   8.085  43.822 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 101.4049     0.4737   214.1   <2e-16 ***
oreos        13.9511     0.8721    16.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.58 on 998 degrees of freedom
Multiple R-squared:  0.2041,    Adjusted R-squared:  0.2033 
F-statistic: 255.9 on 1 and 998 DF,  p-value: < 2.2e-16

Let’s add weight, which we know is correlated with the outcome but not with Oreos.

A higher R-squared, but not accounting for the confounder, so Oreos still significant.

lm_unnecessary_control <- lm(outcome ~ oreos+weight, data = dat)
summary(lm_unnecessary_control)


Call:
lm(formula = outcome ~ oreos + weight, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.388  -7.157  -0.429   7.140  39.942 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 80.25166    1.24358   64.53   <2e-16 ***
oreos       13.39862    0.75838   17.67   <2e-16 ***
weight       1.06730    0.05921   18.03   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.93 on 997 degrees of freedom
Multiple R-squared:  0.3997,    Adjusted R-squared:  0.3985 
F-statistic: 331.9 on 2 and 997 DF,  p-value: < 2.2e-16

Now, let’s do a model adding ONLY the confounder, and leaving out weight.

Now, Oreos is no longer statistically significant, but the R-Squared is lower than the previous model.

lm_confounder <- lm(outcome ~ genetics + oreos, data = dat)
summary(lm_confounder)


Call:
lm(formula = outcome ~ genetics + oreos, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.100  -7.967  -0.003   7.528  36.243 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  99.9043     0.4502 221.913   <2e-16 ***
genetics     17.6328     1.3130  13.429   <2e-16 ***
oreos        -0.6867     1.3538  -0.507    0.612    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.58 on 997 degrees of freedom
Multiple R-squared:  0.326, Adjusted R-squared:  0.3247 
F-statistic: 241.1 on 2 and 997 DF,  p-value: < 2.2e-16

Does that mean the model with weight but not genetics was “better”? Well, in a sense, yes. It would have lower prediction error.

sd(resid(lm_unnecessary_control))

[1] 10.91692

sd(resid(lm_confounder))

[1] 11.56783

But, if you were doing this for statistical inference, it’s clear that a better model is the one that includes the “true” confounder, and not a model that just adds a bunch of noise on the right hand side.

Now, the “most informative” model might be one that includes Oreos, the confounder, and the other covariate. But, from an inferential stand point for the variable of interest (Oreos), this isn’t really an improvement. All you did is buy a higher \(R^2\).

lm_bigger <- lm(outcome ~ genetics + oreos + weight, data = dat)
summary(lm_bigger)


Call:
lm(formula = outcome ~ genetics + oreos + weight, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.878  -6.550  -0.232   6.702  32.616 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 79.26819    1.12254  70.615   <2e-16 ***
genetics    17.00376    1.11712  15.221   <2e-16 ***
oreos       -0.70491    1.15135  -0.612    0.541    
weight       1.04391    0.05338  19.556   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.848 on 996 degrees of freedom
Multiple R-squared:  0.513, Adjusted R-squared:  0.5115 
F-statistic: 349.7 on 3 and 996 DF,  p-value: < 2.2e-16

Now, just for thoroughness, let’s create some more variables that are associated the outcome.

dat$noise1<-dat$outcome+rnorm(1000,0,10)
dat$noise2<-dat$outcome+rnorm(1000,0,10)
dat$noise3<-dat$outcome+rnorm(1000,0,10)

Now, let’s imagine a “garbage can” regression. This model has “great” \(R^2\), but it’s still missing the confounder, so we cannot trust the estimate for the oreos.

lm_biggest <- lm(outcome ~ oreos + weight+noise1+noise2+noise3, data = dat)
summary(lm_biggest)


Call:
lm(formula = outcome ~ oreos + weight + noise1 + noise2 + noise3, 
    data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1985  -3.4831  -0.0143   3.2393  16.1702 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18.08565    1.17753  15.359   <2e-16 ***
oreos        3.89116    0.38456  10.118   <2e-16 ***
weight       0.26350    0.03045   8.654   <2e-16 ***
noise1       0.24644    0.01396  17.656   <2e-16 ***
noise2       0.24548    0.01406  17.457   <2e-16 ***
noise3       0.27641    0.01379  20.043   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.056 on 994 degrees of freedom
Multiple R-squared:  0.8719,    Adjusted R-squared:  0.8713 
F-statistic:  1353 on 5 and 994 DF,  p-value: < 2.2e-16

Once that confounder is in there, game over.

lm_last <- lm(outcome ~ oreos + genetics+weight+noise1+noise2+noise3, data = dat)
summary(lm_last)


Call:
lm(formula = outcome ~ oreos + genetics + weight + noise1 + noise2 + 
    noise3, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.8442  -3.3004  -0.0287   3.3070  17.1606 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 20.84950    1.20798  17.260  < 2e-16 ***
oreos        0.67761    0.57701   1.174    0.241    
genetics     4.42890    0.60468   7.324 4.96e-13 ***
weight       0.29636    0.03001   9.875  < 2e-16 ***
noise1       0.23686    0.01367  17.333  < 2e-16 ***
noise2       0.23314    0.01381  16.886  < 2e-16 ***
noise3       0.26103    0.01360  19.190  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.927 on 993 degrees of freedom
Multiple R-squared:  0.8785,    Adjusted R-squared:  0.8777 
F-statistic:  1196 on 6 and 993 DF,  p-value: < 2.2e-16

You may ask, was this a good model?

It’s seems decent. Let’s publish!

plot(fitted(lm_last),resid(lm_last))

Another thought - statistical significance

One odd thing about this set up is that we “know” there is a “real” effect of Oreos. We designed the data this way. So, is the problem just sample size? If we had a huge sample, could we eventually “find” the “effect” of Oreos?

Why, yes we could.

Here, I am going to generate a huge sample, of 100,000 observations. Everything else is the same as before.

set.seed(123)  # Set seed for reproducibility

# Number of individuals --- MUCH BIGGER
n <- 100000

# Exposure (oreos)
oreos <- rbinom(n, 1, 0.3)  # 0 (non-oreo-eater), 1 (oreo-eater)

# Confounder (genetics) More likely for oreo_eaters

genetics <- rep(NA,length(oreos))

for(i in 1:length(oreos)){
  if(oreos[i]==1){genetics[i]<-rbinom(1,1,0.9)}
  if(oreos[i]==0){genetics[i]<-rbinom(1,1,0.1)}
}

# Another variable that also has an effect on health, let's call it weight
# this is uncorrelated with oreos or weight
weight <- runif(1000,10,30)

# True effect of oreos (weak effect)
true_effect_oreos <- 1

# True effect of genetics (strong effect)
true_effect_genetics <- 15

# Outcome (Health measure) - function of several things
outcome <- rnorm(n, 
                 mean = 80 + 
                   genetics + 
                   oreos * true_effect_oreos + 
                   genetics * true_effect_genetics + 
                   weight, 
                 sd = 10)

# make a data frame

dat<-data.frame(oreos,
                outcome, 
                weight, 
                genetics)

Here’s the model:

lm_bigger <- lm(outcome ~ genetics + oreos + weight, data = dat)
summary(lm_bigger)


Call:
lm(formula = outcome ~ genetics + oreos + weight, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.851  -6.734  -0.001   6.731  41.859 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 80.14695    0.11659 687.440  < 2e-16 ***
genetics    16.13196    0.10576 152.539  < 2e-16 ***
oreos        0.86421    0.10949   7.893 2.98e-15 ***
weight       0.99495    0.00543 183.223  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.04 on 99996 degrees of freedom
Multiple R-squared:  0.4913,    Adjusted R-squared:  0.4913 
F-statistic: 3.219e+04 on 3 and 99996 DF,  p-value: < 2.2e-16

Oh my, the Oreos effect is back!

Let’s sample 100 observations.

set.seed(5678)
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

sub_sample<-sample_n(dat,100)

And model again.

lm_bigger <- lm(outcome ~ genetics + oreos + weight, data = sub_sample)
summary(lm_bigger)


Call:
lm(formula = outcome ~ genetics + oreos + weight, data = sub_sample)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.7173  -5.8011  -0.6805   5.5298  27.0875 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  79.5654     3.0002  26.520  < 2e-16 ***
genetics     15.3885     3.2885   4.679 9.43e-06 ***
oreos         5.0093     3.4258   1.462    0.147    
weight        1.0023     0.1466   6.839 7.42e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.779 on 96 degrees of freedom
Multiple R-squared:  0.621, Adjusted R-squared:  0.6092 
F-statistic: 52.44 on 3 and 96 DF,  p-value: < 2.2e-16

Oh no, the effect is gone again!

How much should we trust regression analyses at all, given that we can make effects “disappear” and “reappear” so easily?

One wonders.

Incidentally, this is a good illustration of why, when are doing statistical inference, we only say that we “fail to reject” the null hypothesis if we don’t see statistical significance in a regression model. The relationship is “there” in these data, but we need a lot of data (or good luck in our sampling) to detect it. So, not having “stars” in a model doesn’t mean there’s no relationship - although that’s possible, it may just be that we don’t have the statistical power we need to pick up something subtle. Or, we might have just got a bad draw.