# Simulate data
set.seed(123) # Set seed for reproducibility
# Number of individuals
<- 1000
n
# Exposure (oreos)
<- rbinom(n, 1, 0.3) # 0 (non-oreo-eater), 1 (oreo-eater)
oreos
# Confounder (genetics) More likely for oreo_eaters
<- rep(NA,length(oreos))
genetics
for(i in 1:length(oreos)){
if(oreos[i]==1){genetics[i]<-rbinom(1,1,0.9)}
if(oreos[i]==0){genetics[i]<-rbinom(1,1,0.1)}
}
# Another variable that also has an effect on health, let's call it weight
# this is uncorrelated with oreos or weight
<- runif(1000,10,30)
weight
# True effect of oreos (weak effect)
<- 1
true_effect_oreos
# True effect of genetics (strong effect)
<- 15
true_effect_genetics
# Outcome (Health measure) - function of several things
<- rnorm(n,
outcome mean = 80 + ## 80 is just the baseline mean, it's arbitrary
+
genetics * true_effect_oreos +
oreos * true_effect_genetics +
genetics
weight, sd = 10)
# make a data frame
<-data.frame(oreos,
dat
outcome,
weight, genetics)
Goodness of Fit, Confounders, and Power - a Revisit
This is a short vignette that reviews some issues associated with confounders, \(R^2\), statistical power, and model building.
In this first chunk, I am simulating some data. I am simulating data on Oreo(™) consumption and a health outcome. We are going to assume that the health outcome is scaled such that higher values are “bad.” Our research question is whether eating Oreos is bad for you.
You don’t need to follow this data generation, unless you want to. The bottom line is that Oreo eating is correlated with the higher health outcome. But, Oreo eating is determined substantially by some genetic trait.
That same genetic trait also is associated with the higher health outcome.
So, genetics is a confounder. It is correlated with both the outcome and the predictor (treatment, if you will) of interest.
Now, there is another variable here called “weight.” In this world, Oreos and weight are not correlated with one another (although I admit that is a bit of a stretch). But, weight is very much associated with the outcome. So, weight is not a confounder for the Oreo –> Outcome relationship, but it is itself a significant predictor of the outcome.
Disclaimer: I wrote this simulation with the help of Google Gemini. It was fun! It gave a me several wrong versions of the simulation, and I taught it to do better. Then, once it was close, I went in and fixed a few things.
Here’s the data generation.
Let’s prove to ourselves there are some correlations here.
Outcome and Oreos are correlated.
cor(dat$outcome,dat$oreos)
[1] 0.4517602
Outcome and weight are correlated.
cor(dat$outcome,dat$weight)
[1] 0.4602018
Outcome and genetics are correlated.
cor(dat$outcome,dat$genetics)
[1] 0.5708168
Genetics and Oreos are correlated.
cor(dat$genetics,dat$oreos)
[1] 0.8051312
Weight and genetics are not correlated.
cor(dat$weight,dat$genetics)
[1] 0.04960281
Oreos and weight are not correlated.
cor(dat$oreos,dat$weight)
[1] 0.04041539
Now, we are will build some models and look at the \(R^2\) values.
First, just the bivariate relationship between the outcome and Oreos.
Highly statistically significant for Oreos, low-ish R2.
<- lm(outcome ~ oreos, data = dat)
lm_unadjusted summary(lm_unadjusted)
Call:
lm(formula = outcome ~ oreos, data = dat)
Residuals:
Min 1Q Median 3Q Max
-39.462 -8.595 -0.480 8.085 43.822
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101.4049 0.4737 214.1 <2e-16 ***
oreos 13.9511 0.8721 16.0 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.58 on 998 degrees of freedom
Multiple R-squared: 0.2041, Adjusted R-squared: 0.2033
F-statistic: 255.9 on 1 and 998 DF, p-value: < 2.2e-16
Let’s add weight, which we know is correlated with the outcome but not with Oreos.
A higher R-squared, but not accounting for the confounder, so Oreos still significant.
<- lm(outcome ~ oreos+weight, data = dat)
lm_unnecessary_control summary(lm_unnecessary_control)
Call:
lm(formula = outcome ~ oreos + weight, data = dat)
Residuals:
Min 1Q Median 3Q Max
-35.388 -7.157 -0.429 7.140 39.942
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.25166 1.24358 64.53 <2e-16 ***
oreos 13.39862 0.75838 17.67 <2e-16 ***
weight 1.06730 0.05921 18.03 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.93 on 997 degrees of freedom
Multiple R-squared: 0.3997, Adjusted R-squared: 0.3985
F-statistic: 331.9 on 2 and 997 DF, p-value: < 2.2e-16
Now, let’s do a model adding ONLY the confounder, and leaving out weight.
Now, Oreos is no longer statistically significant, but the R-Squared is lower than the previous model.
<- lm(outcome ~ genetics + oreos, data = dat)
lm_confounder summary(lm_confounder)
Call:
lm(formula = outcome ~ genetics + oreos, data = dat)
Residuals:
Min 1Q Median 3Q Max
-34.100 -7.967 -0.003 7.528 36.243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.9043 0.4502 221.913 <2e-16 ***
genetics 17.6328 1.3130 13.429 <2e-16 ***
oreos -0.6867 1.3538 -0.507 0.612
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.58 on 997 degrees of freedom
Multiple R-squared: 0.326, Adjusted R-squared: 0.3247
F-statistic: 241.1 on 2 and 997 DF, p-value: < 2.2e-16
Does that mean the model with weight but not genetics was “better”? Well, in a sense, yes. It would have lower prediction error.
sd(resid(lm_unnecessary_control))
[1] 10.91692
sd(resid(lm_confounder))
[1] 11.56783
But, if you were doing this for statistical inference, it’s clear that a better model is the one that includes the “true” confounder, and not a model that just adds a bunch of noise on the right hand side.
Now, the “most informative” model might be one that includes Oreos, the confounder, and the other covariate. But, from an inferential stand point for the variable of interest (Oreos), this isn’t really an improvement. All you did is buy a higher \(R^2\).
<- lm(outcome ~ genetics + oreos + weight, data = dat)
lm_bigger summary(lm_bigger)
Call:
lm(formula = outcome ~ genetics + oreos + weight, data = dat)
Residuals:
Min 1Q Median 3Q Max
-30.878 -6.550 -0.232 6.702 32.616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.26819 1.12254 70.615 <2e-16 ***
genetics 17.00376 1.11712 15.221 <2e-16 ***
oreos -0.70491 1.15135 -0.612 0.541
weight 1.04391 0.05338 19.556 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.848 on 996 degrees of freedom
Multiple R-squared: 0.513, Adjusted R-squared: 0.5115
F-statistic: 349.7 on 3 and 996 DF, p-value: < 2.2e-16
Now, just for thoroughness, let’s create some more variables that are associated the outcome.
$noise1<-dat$outcome+rnorm(1000,0,10)
dat$noise2<-dat$outcome+rnorm(1000,0,10)
dat$noise3<-dat$outcome+rnorm(1000,0,10) dat
Now, let’s imagine a “garbage can” regression. This model has “great” \(R^2\), but it’s still missing the confounder, so we cannot trust the estimate for the oreos.
<- lm(outcome ~ oreos + weight+noise1+noise2+noise3, data = dat)
lm_biggest summary(lm_biggest)
Call:
lm(formula = outcome ~ oreos + weight + noise1 + noise2 + noise3,
data = dat)
Residuals:
Min 1Q Median 3Q Max
-15.1985 -3.4831 -0.0143 3.2393 16.1702
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.08565 1.17753 15.359 <2e-16 ***
oreos 3.89116 0.38456 10.118 <2e-16 ***
weight 0.26350 0.03045 8.654 <2e-16 ***
noise1 0.24644 0.01396 17.656 <2e-16 ***
noise2 0.24548 0.01406 17.457 <2e-16 ***
noise3 0.27641 0.01379 20.043 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.056 on 994 degrees of freedom
Multiple R-squared: 0.8719, Adjusted R-squared: 0.8713
F-statistic: 1353 on 5 and 994 DF, p-value: < 2.2e-16
Once that confounder is in there, game over.
<- lm(outcome ~ oreos + genetics+weight+noise1+noise2+noise3, data = dat)
lm_last summary(lm_last)
Call:
lm(formula = outcome ~ oreos + genetics + weight + noise1 + noise2 +
noise3, data = dat)
Residuals:
Min 1Q Median 3Q Max
-14.8442 -3.3004 -0.0287 3.3070 17.1606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.84950 1.20798 17.260 < 2e-16 ***
oreos 0.67761 0.57701 1.174 0.241
genetics 4.42890 0.60468 7.324 4.96e-13 ***
weight 0.29636 0.03001 9.875 < 2e-16 ***
noise1 0.23686 0.01367 17.333 < 2e-16 ***
noise2 0.23314 0.01381 16.886 < 2e-16 ***
noise3 0.26103 0.01360 19.190 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.927 on 993 degrees of freedom
Multiple R-squared: 0.8785, Adjusted R-squared: 0.8777
F-statistic: 1196 on 6 and 993 DF, p-value: < 2.2e-16
You may ask, was this a good model?
It’s seems decent. Let’s publish!
plot(fitted(lm_last),resid(lm_last))
Another thought - statistical significance
One odd thing about this set up is that we “know” there is a “real” effect of Oreos. We designed the data this way. So, is the problem just sample size? If we had a huge sample, could we eventually “find” the “effect” of Oreos?
Why, yes we could.
Here, I am going to generate a huge sample, of 100,000 observations. Everything else is the same as before.
set.seed(123) # Set seed for reproducibility
# Number of individuals --- MUCH BIGGER
<- 100000
n
# Exposure (oreos)
<- rbinom(n, 1, 0.3) # 0 (non-oreo-eater), 1 (oreo-eater)
oreos
# Confounder (genetics) More likely for oreo_eaters
<- rep(NA,length(oreos))
genetics
for(i in 1:length(oreos)){
if(oreos[i]==1){genetics[i]<-rbinom(1,1,0.9)}
if(oreos[i]==0){genetics[i]<-rbinom(1,1,0.1)}
}
# Another variable that also has an effect on health, let's call it weight
# this is uncorrelated with oreos or weight
<- runif(1000,10,30)
weight
# True effect of oreos (weak effect)
<- 1
true_effect_oreos
# True effect of genetics (strong effect)
<- 15
true_effect_genetics
# Outcome (Health measure) - function of several things
<- rnorm(n,
outcome mean = 80 +
+
genetics * true_effect_oreos +
oreos * true_effect_genetics +
genetics
weight, sd = 10)
# make a data frame
<-data.frame(oreos,
dat
outcome,
weight, genetics)
Here’s the model:
<- lm(outcome ~ genetics + oreos + weight, data = dat)
lm_bigger summary(lm_bigger)
Call:
lm(formula = outcome ~ genetics + oreos + weight, data = dat)
Residuals:
Min 1Q Median 3Q Max
-43.851 -6.734 -0.001 6.731 41.859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.14695 0.11659 687.440 < 2e-16 ***
genetics 16.13196 0.10576 152.539 < 2e-16 ***
oreos 0.86421 0.10949 7.893 2.98e-15 ***
weight 0.99495 0.00543 183.223 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.04 on 99996 degrees of freedom
Multiple R-squared: 0.4913, Adjusted R-squared: 0.4913
F-statistic: 3.219e+04 on 3 and 99996 DF, p-value: < 2.2e-16
Oh my, the Oreos effect is back!
Let’s sample 100 observations.
set.seed(5678)
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
<-sample_n(dat,100) sub_sample
And model again.
<- lm(outcome ~ genetics + oreos + weight, data = sub_sample)
lm_bigger summary(lm_bigger)
Call:
lm(formula = outcome ~ genetics + oreos + weight, data = sub_sample)
Residuals:
Min 1Q Median 3Q Max
-19.7173 -5.8011 -0.6805 5.5298 27.0875
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.5654 3.0002 26.520 < 2e-16 ***
genetics 15.3885 3.2885 4.679 9.43e-06 ***
oreos 5.0093 3.4258 1.462 0.147
weight 1.0023 0.1466 6.839 7.42e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.779 on 96 degrees of freedom
Multiple R-squared: 0.621, Adjusted R-squared: 0.6092
F-statistic: 52.44 on 3 and 96 DF, p-value: < 2.2e-16
Oh no, the effect is gone again!
How much should we trust regression analyses at all, given that we can make effects “disappear” and “reappear” so easily?
One wonders.
Incidentally, this is a good illustration of why, when are doing statistical inference, we only say that we “fail to reject” the null hypothesis if we don’t see statistical significance in a regression model. The relationship is “there” in these data, but we need a lot of data (or good luck in our sampling) to detect it. So, not having “stars” in a model doesn’t mean there’s no relationship - although that’s possible, it may just be that we don’t have the statistical power we need to pick up something subtle. Or, we might have just got a bad draw.