set.seed(1234)
<- rnorm(10000,5,1) outcome
Module 7 - Experiments
Why Experiments?
Regression is a useful tool for prediction and inference, but it has a limitation.
Regression cannot account for omitted variable bias. There are almost always going to be lurking confounders.
With well-run, genuinely randomized experiment, there are no confounders.
Experiments are fun! But hard, for reasons we’ll discuss.
Breastfeeding
In the developed world, there is a strong, positive correlation between breastfeeding and the health of a baby.
In regression format, this would look like:
\[ BabyHealth = b_0 +b_1 * Breastfed + \epsilon\]
And the coefficient \(b_1\) would be positive and statistically significant.
Is that regression reliable?
You should know at this point that this regression is very weak evidence in support of the hypothesis.
Included in the \(\epsilon\) error term are many omitted confounders.
An obvious one would be education, which is a correlated with baby health and breastfeeding.
Breastfeeding is commonly thought to be better, so educated people follow that advice.
Educated people can provide better material support for babies in a variety of ways.
Omitted Variable Bias
If you estimate the models controlling for or not controlling for education, your results for the coefficient on breastfeeding could be (1) different in direction or magnitude and/or (2) have different standard errors (and thus be statistically significant or not)
\(BabyHealth = b_0 +b_{1} * Breastfed + \epsilon\)
\(BabyHealth = b_0 +b_{1} * Breastfed + b_2 * education + \epsilon\)
- But of course, you can’t really control for every reasonable confounder, and some confounders won’t be in your data or aren’t observable.
Reverse Causation
Additionally, a healthy baby might be more likely to latch, or the healthy conditions in which a baby lives might be associated with a mother believing breastfeeding is important.
Results of Breastfeeding Experiment
In fact, some randomized experiments show minimal differences between breastfeeding and babies’ health.
Now, this is not evidence against breastfeeding for any individual person, but it might be informative about where to direct public health efforts.
Why do experiments work?
Experiments work because of the law of large numbers. On average, sample values (like means) will approach population values as sample size increases.
This is another way of saying that truly randomized sampling processes, with larger sample sizes, remove bias from estimation processes.
A mean of a sample of 1000 people is likely to be pretty close to the population mean.
So, the “average” value of a biasing variable (confounder) between a treatment and control group is likely to be the same, in expectation.
Accessing Alternative Worlds
By extension, \(\bar{Y}_{0T}\) should be the same as \(\bar{Y}_{0U}\)
So \(\bar{Y}_{1T} - \bar{Y}_{0U}\) should be an unbiased estimate of the average treatment effect
There will be some noise from sampling \(\bar{Y}_{1T} - \bar{Y}_{0U} + \text{noise}\)
But noise is measurable using standard errors, and we can construct CIs
We know how to estimate the SE of a difference in means
\(\sqrt{\frac{Var(\text{Treated})}{\text{N Treated}} + \frac{Var(\text{Untreated})}{\text{N Untreated}}}\)
Blocking and Stratification
Of course, even with random assignment, you will sometimes get uneven “balance” on observables
Try randomly assigning a treatment to a group of 50 men and 50 women. Sometimes your treatment group is going to be more “male” than than the control group.
You can address this through “blocking” or “stratification”, intentionally dividing your participants up and randomly assigning within “strata”
Make sure you randomly assign half the males to be treated and half the females to be treated.
This can get very complicated.
Sample Size Considerations
Let’s go back to that SE formula again (or you could think about this as a bivariate regression with an indicator variable)
\(\sqrt{\frac{Var(\text{Treated})}{\text{N Treated}} + \frac{Var(\text{Untreated})}{\text{N Untreated}}}\)
If we want a statistical precision, we want a small SE.
We get a small standard error by making the stuff under the square root sign smaller
So, we want small variance outcomes, and large numbers of treated and control.
Another problem - statistical power
Experiments tend to be small because they are expensive and administratively difficult.
This means that your samples could be small, and you might not have the statistical power you need to detect small effects.
The intuition behind statistical power
Let’s look at a very big population. Pre-treatment, the entire population fluctuates around a mean value on the outcome of interest.
Statistical power intuition
Now, let’s randomly assign treatment.
<- tibble(outcome=outcome,treated=rep(NA,length(outcome)))
population for(i in 1:dim(population)[1]){
$treated[i]<-sample(c(0,1),1)
population
}table(population$treated)
0 1
4986 5014
Assigning the treatment effect
Now, let’s give a very small “bump” to the treatment group.
$effect<-NA
population### this is very inefficient coding but I do it for clarity
for(i in 1:dim(population)[1]){
if(population$treated[i]==1){
$effect[i]<-runif(1,0,.25)
populationelse population$effect[i]<-0}
} $post<-population$outcome+population$effect population
At population level, treatment is significant
<-t.test(population[population$treated==1,]$post,population[population$treated==0,]$post)
t_test_pop
$p.value t_test_pop
[1] 3.935439e-10
In (almost) every sample, there is an “effect”, but it’s small
The t.test with only 50 is underpowered.
set.seed(5678)
<-sample_n(population,50,replace=FALSE)
a_sample#### This is not significant
<-t.test(a_sample[a_sample$treated==1,]$post,a_sample[a_sample$treated==0,]$post)
t_test_sample
$p.value t_test_sample
[1] 0.685492
If you get a bigger sample, you’ll find the effect again.
<-sample_n(population,5000,replace=FALSE)
big_sample<-t.test(big_sample[big_sample$treated==1,]$post,big_sample[big_sample$treated==0,]$post)
t_test_big_samp$p.value t_test_big_samp
[1] 3.201221e-06
Administrative Problems with Experiments - Too man outcomes
Even with experiments, if you run enough t-tests, eventually you might find something statistically significant, just by chance.
Researchers will probably only report statistically significant results.
This is why it’s good to plan your projects carefully (as you are practicing this semester)
Administrative Problems with Experiments - Imbalance
One issue with experiments, you might intuit, is that despite our best efforts, you can still get samples that aren’t statistically identical.
Income distribution between treatment and control may not look the same, for instance.
What can I do about imbalance?
Depending on context, you might toss the results, but this might not be necessary.
You can instead control for observables using regression, but this could bias your estimates.
There’s really no solution for this, other than transparency.
Attrition
Related to statistical power and sample size and covariate imbalance is the issue of attrition.
Some people will leave the study. This decreases sample size and may introduce systematic differences between treatment and control groups.
If you are in the “control” for a weight loss study, and you aren’t losing weight, you may just drop out of the study.
Interference
A big assumption of experiments is experimentalists call the “stable unit treatment value assumption” (SUTVA).
If the treatment status of one participant has a “spillover” effect on another untreated participant, then you comparison of treated and untreated is problematic.
If I start a substance abuse prevention program, randomly assign to person A but not person B, but person A’s use or non-use influences person B, then I no longer have a good treated/untreated comparison between A and B.
Noncompliance
Finally, a problem with experiments is that some participants won’t play ball
They won’t take the pill, or they will take the pill even if they aren’t supposed to.
The simplest solution to this? Just agree that what you are actually going to compare is the intent-to-treat effect (ITT) rather than the average treatment effect (ATE). This is usually more policy-relevant anyway.
What if you really want to retrieve the ATE from the the ITT?
Sometimes, you really want to get the best possible estimate of the effect of the treatment, even when you know there is a problem with noncompliance.
In that case, you actually want to calculated the complier average treatment effect (CATE) - the effect of the treatment among those that actually follow took the treatment when told and didn’t take the treatment when in the control group.
How would you get the CATE?
- Let’s say you know the proportion of compliers in the treatment group.
- To skip some hoops and justification that you can read in the book, you can divide the ITT by the proportion of compliers, and you’ll get the portion of the ITT that is attributable to the compliers.
CATE Example
The book has a straightforward example of moving from ITT to CATE, but here’s another.
Let’s say I run an experiment where I try to motivate citizens to talk to a neighbor of a different racial group, and I want to see if this inter-group contact changes attitudes about diversity in America.
I might find that citizens who received my motivating letter have a 1.5 unit increase in positive feelings towards diversity.
But let’s say I had hidden cameras and I knew that only 60% (.6) of treated group actually had a conversation.
So, I divide 1.5/0.6, and my CATE is actually 0.25 –> considerably lower
The Hunt for “Natural Experiments”
You might wonder why discuss experiments, when often experiments are costly, impossible, or unethical.
It’s because it is the “starting point” for inference.
Imbalance, sample size considerations, interference, attrition, and noncompliance are all considerations in observational studies just as much
And so, as you think about causal inference using observational data, you can hunt out “natural experiments” where treatments have been assigned (perhaps randomly), and compare “treated and untreated”