title: 'Tutorial 1: Analysis and Design of Experiments' date: "April 2022" output: htmldocument: highlight: haddock theme: journal numbersections: no toc: yes tocdepth: 2 tocfloat: yes pdfdocument: toc: yes tocdepth: '2' ---
{r setup, echo = F} knitr::opts_chunk$set( eval = TRUE, echo = TRUE, warning = FALSE, message = FALSE )
In this tutorial, you will learn the basics about data analysis of an experiment. You will not only review how to evaluate the outcomes, but also how to make sure your experimental data is well-suited for analysis. Then, you will learn how to use power calculations as a planning tool for experiments with different types of outcomes, and how to select an outcome based on power analysis and the objectives of your experiment.
We start by loading the necessary packages.
{r} library(here) # path to files library(tidyverse) # for summarizing and manipulating data library(estimatr) # for estimating difference in means library(knitr) # for HTML tables library(kableExtra) # for styling HTML tables library(cobalt) # for covariate balance plots library(ggthemes) # for color blind palette library(ggplot2) # for creating graphs library(stats) # for t-tests and power calculations
For this tutorial, we will use the following dataset from a randomized experiment:
{r read_data} df = read_csv("mobilization_small.csv")
Here is a description of the experiment from the journal article:
...a large-scale field experiment in which individuals were randomly assigned to treatment and control groups. The field experiment was conducted in Iowa and Michigan before the 2002 midterm elections. The congressional districts of each state were divided into 'competitive' and 'uncompetitive' strata. Within each stratum, households containing one or two registered voters were randomly assigned to treatment and control groups. For two-person households, just one representative from each household was assigned to treatment or control; if there was another voter in the household, he or she was ignored for purposes of calling and statistical analysis. Only one type of treatment was used: get-out-the-vote (GOTV) phone calls.
Two national phone banks were hired to read a nonpartisan GOTV message to individuals in the treatment group. The script read as follows:
Hello, may I speak with (name of person) please? Hi. This is (caller's name) calling from Vote 2002, a nonpartisan effort working to encourage citizens to vote. We just wanted to remind you that elections are being held this Tuesday. The success of our democracy depend on whether we exercise our right to vote or not, so we hope you'll come out and vote this Tuesday. Can I count on you to vote next Tuesday?
Members in the control group were not called.
Respondents were coded as contacted if they listened to the script and replied to the question, "Can I count on you to vote next Tuesday?" regardless of whether they answered yes or no (or volunteered some other answer). After the election, public voting records on each individual were obtained, allowing us to assess whether the GOTV phone appeals stimulated voter turnout.
(Arceneaux, Gerber and Green (2006), pg. 40-41, emphasis added).
Our data is unlabeled, so to help understand some of the variables, here is a list of some key variables that we will work with:
Variable | Description -------- | ------------- age | Age
persons | Number of registered voters in household
newreg | Newly registered voter female | Indicator
variable where $= 1$ indicates female stsen_ | Index of state
senate district sthse_ | Index of state congressional district
vote00 | Indicator variable where $= 1$ indicates voted in the
2000 general election vote98 | Indicator variable where $= 1$
indicates voted in the 1998 midterm election county | County
index state | US State indicator variable where $= 0$ indicates
Michigan and $= 1$ indicates Iowa compmi_ | Indicator of
whether the congressional district was classified as competitive $( =
2)$ or uncompetitive ($ = 1$) in Michigan. Observations not from
Michigan have a value of $0$. compia_ | Indicator of whether
the congressional district was classified as competitive $( = 2)$ or
uncompetitive ($ = 1$) in Iowa. Observations not from Iowa have a value
of $0$. vote02 | Indicator variable where $= 1$ indicates voted
in the 2002 midterm election treatpseudo_ | Indicator variable
where $= 1$ indicates person was assigned to the treatment group
treatreal_ | Indicator variable where $= 1$ indicates person
was in actualized treatment group (i.e. among the people with a listed
phone number) contact | Indicator variable where $= 1$
indicates person was contacted (NA
- indicates an
observation with an unlisted phone number)
As the description reads, the random assignment to the treatment
groups was carried out within each strata separately, so it will be
important to take this into account as we will see later. We will create
a variable strata
indicating which state (Iowa / Michigan)
and district (Competitive / Not competitive) the households are in. We
will identify them with the following names: Iowa NC
,
Iowa C
, Michigan NC
, and
Michigan C
{r strata_def} sample_df = df %>% # code a region variable # mutate(.data) adds new variables and preserves existing ones mutate(strata = as.factor( # case_when is another way of doing multiple if_else statement case_when( state == 0 & comp_mi == 1 ~ "Michigan NC", state == 0 & comp_mi == 2 ~ "Michigan C", state == 1 & comp_ia == 1 ~ "Iowa NC", state == 1 & comp_ia == 2 ~ "Iowa C", TRUE ~ NA_character_ # if none of the above, set it to NA ) ), # create dummy variables for competitive districts in Michigan and Iowa comp_mi_dummy = ifelse(comp_mi == 2, 1, 0), comp_ia_dummy = ifelse(comp_ia == 2, 1, 0), )
In an experimental setting, we can think of the data in terms of three types of variables: outcomes, treatments, and covariates.
Outcome variables measure features of the world we
are interested in affecting with our interventions. vote02
is an example of an outcome variable in our data. Outcome measures are
arguably the most important part of an experiment, so it is extremely
important to choose them carefully. That is the reason we dedicate a
section exclusively to this matter at the end of this tutorial.
Treatment variables indicate which treatment arm
each experimental unit is in. If the experiment is such that it only has
two treatment groups (for example, in an A/B testing context), a binary
variable is sufficient to represent the treatment assignment of the
units. In our example dataset we have three different treatment
variables: treat_pseudo
, treat_real
, and
contact
. The differences between them may pass as subtle,
but are actually very important. Consider first
treat_pseudo
and treat_real
. The first one
indicates whether households were assigned to the original treatment or
control groups, while the second one indicates whether households were
assigned to the original treatment or control groups and had a
listed phone number. Since the intervention is focused around phone
calls, it makes sense to only consider those households for which we
have the phone number available. However, by doing so, we are reducing
the size of our sample, and this is something important we should take
into consideration (we will delve into this matter later). Let us count
the households in each group and put numbers to this reduction in sample
size:
{r treat_pseudo} sample_df %>% # we want to know frequencies by treat_pseudo and treat_real, so # we group by these variables group_by(treat_pseudo, treat_real) %>% # count the number of observations by values of treat_pseudo and treat_real summarize(n_observations = n(), .groups = 'drop') %>% # format numbers kable(format.args = list(big.mark = ",")) %>% #additional styling options kable_styling(bootstrap_options = "striped")
We can see that 162,960 households from the control group and 7,444 households from the treatment group do not have their phone number listed (as represented by the NA entry). These numbers represent about 30% of the original control group and 22% of the original treatment group, which is a considerable reduction in the sample size.
Sample size is not the only thing we should consider in cases like
this. From all households that were called, only a fraction of them
answered, as indicated by the variable contact
. Which
variable should we use as our treatment? It turns out that the choice of
one or the other depends on which question we want to answer. If we
think of the intervention as calling households regardless of whether
they pick up the call or not, then we will be good using
treat_real
as our treatment. If we are interested in the
effect of what is being said in the call, then treat_real
does not address the fact that some calls might not be answered, and
using contact
might be a better option.^1
For this tutorial, we will stick to using treat_real
.
{r treat_real} # remove rows for which treat_real is missing filtered_df = filter(sample_df, !is.na(treat_real))
Optional: test your comprehension. In a similar way
as we did for treat_pseudo
and treat_real
,
create a table with treat_real
and contact
to
understand how the two differ, and interpret those differences.
``{r Opt_1} # Replace the NULL values in the function below by #
treat_realand
contact`,
respectively
```
Lastly, we will look at how the treatment assignment was carried out across the different strata. The code below will show you the share of observations across each stratum that were assigned to treatment.
{r treat_by_strata} filter(filtered_df) %>% group_by(strata) %>% # we add the option "na.rm = TRUE" to bypass observations with missing entries summarize(treat_prop = mean(treat_real, na.rm = TRUE), .groups = 'drop') %>% kable(format.args = list(big.mark = ",")) %>% kable_styling(bootstrap_options = "striped")
There are substantial differences among the strata in the share of households that have been assigned to treatment. This will be very important to take into account at the moment of the analysis, as we will see later.
The final type of variables we have in our data are the
covariates. Covariates are independent variables that
could affect our outcomes of interest, but are not of direct interest to
our study. They are typically measured before interventions take place.
As such, they are usually referred to as "pre-treatment" variables and
thus cannot be affected by our interventions. Covariates typically
describe characteristics of the experimental units such as, in our
example dataset, age
, persons
,
newreg
, and female
. While they might not
directly be of our concern in terms of the experiment, measuring
covariates serves very important purposes such as providing evidence
that the randomization was correctly implemented (something we will do
shortly), or studying effects of interventions for different subgroups
of our sample (something we will do in our next tutorial).
In this section, we will cover topics on the analysis of experiments. First, we will take a detour to review important statistical concepts from hypothesis testing methodology. Then, using that machinery, we will estimate treatment effects and conduct statistical inference about them.
Statistical hypothesis testing is a method that allows us to determine to which extent the data provides evidence in favor of (or against) a particular hypothesis. In this context, a hypothesis is a well defined statement about some parameter (or parameters) of a particular population(s). Suppose that we pose the hypothesis that the average age of households in competitive Iowa that were assigned to treatment is equal to the average age of households in competitive Iowa that were assigned to control. This is a desirable thing, as we want both groups to be as similar as possible (to learn more about this, check the optional section "Balance Checks" further below). Letting the former be defined as $\mu1$ and the latter as $\mu0$, we can formally define our hypothesis as:
$$ \boxed{H0: \mu1 = \mu0} $$ Here, $H0$ is known as the null hypothesis, and it is the hypothesis that we will challenge, in favor of an alternative hypothesis, which, for example, can be:
$$ \boxed{HA: \mu1 \neq \mu_0} $$
This is known as a "two-sided" alternative hypothesis, as it will be true if $\mu1 > \mu0$ or if $\mu1 < \mu0$. If we had some prior knowledge about the nature of the parameters, we might benefit from posing a "one-sided" alternative hypothesis instead (for reasons that will become clear later), such as:
$$ \boxed{HA: \mu1 > \mu_0} $$
We can also reformulate these hypotheses to be about the difference of the parameters, say $\tau = \mu1 - \mu0$, as follows: $$ \boxed{H0: \tau = 0 \ HA: \tau \neq 0} $$ Note that null and alternative hypotheses are statements about population parameters, which we do not know (otherwise, we would not need all this machinery!). Our best bet is to use the data we have to provide evidence in favor of or against them. We define a test statistic as some function of the data that will allow us to determine whether we reject the null hypothesis in favor of the alternative or not. A choice that appears intuitive to make for a test statistic is to estimate the parameter using some function of the data, called an estimator of such parameter, and draw conclusions about the hypotheses using that quantity. We usually symbolize an estimator of a parameter $\tau$ using the same symbol with a hat, $\hat{\tau}$.
We cannot, however, just use the estimator of a parameter to draw conclusions -- we also need to know how precise that estimation is. Let's continue with our running example, the difference between average age in competitive Iowa across treatment and control groups. As you might remember from your Statistics class, the sample mean is a good estimator for the population mean, and the difference in the sample means ($\hat{\tau}$) is a good estimator for the difference in population means ($\tau$). Formally:
$$ \boxed{\hat{\tau} = \hat{\mu}1 - \hat{\mu}0} $$
where $\hat{\mu}1$ and $\hat{\mu}0$ are the sample means from the treatment and control populations, respectively. In our example:
```{r hyptest1} tauhatage = mean(filter(filtereddf, strata == "Iowa C", treatreal == 1)$age, na.rm=TRUE) - mean(filter(filtereddf, strata == "Iowa C", treatreal == 0)$age, na.rm=TRUE)
tauhatage ```
If we were to draw conclusions from just this estimation, we would conclude that the two population means are different. But what if the population means are the same, and the difference we observe is due to chance? After all, we are randomly sampling from the larger populations, so we cannot really rule out that possibility. In fact, it is incredibly unlikely that we would ever draw samples like these in which the difference in age was exactly $0$, even if the distribution of ages in the populations were identical. To account for this variation due to random sampling, or sampling variation, we estimate $\hat{\tau}$'s standard error as follows:
$$ \boxed{\hat{\sigma}{\hat{\tau}} = \sqrt{\widehat{Var}[\hat{\tau}]} = \sqrt{\dfrac{\hat{\sigma}^21}{n1} + \dfrac{\hat{\sigma}^20}{n_0}}} $$
where $\hat{\sigma}1^2$ and $\hat{\sigma}0^2$ are the sample variances for each group, and $n1$ and $n0$ are the number of observations in each group. Let us compute the estimated standard error for our example:
```{r hyptest2} sigma1age = var(filter(filtereddf, strata == "Iowa C", treatreal == 1)$age, na.rm=TRUE) sigma0age = var(filter(filtereddf, strata == "Iowa C", treatreal == 0)$age, na.rm=TRUE) n1age = nrow(filter(filtereddf, strata == "Iowa C",treatreal == 1, na.rm = TRUE)) n0age = nrow(filter(filtereddf, strata == "Iowa C",treatreal == 0, na.rm = TRUE))
sigmatauhatage = sqrt(sigma1age/n1age + sigma0age/n0age) sigmatauhatage
```
Now that we know we need to consider the standard error in our hypothesis testing procedure and how to estimate it, we need to define a proper test statistic for our comparison of means. The usual choice for this type of tests is the $z$-statistic, which is defined to be the ratio of two quantities: the difference between the estimator and the parameter of interest, $\hat{\tau} - \tau$, and the standard error of the estimator, $\sigma_{\hat{\tau}}$. Formally:
$$ \boxed{Z = \dfrac{\hat{\tau}-\tau}{\sigma_{\hat{\tau}}}} $$ This statistic can be interpreted as follows. If the null hypothesis is true, $\tau = 0$. Since $\hat{\tau}$ is a good estimator of $\tau$, we expect it to be $0$ as well. But as we saw previously, it is unlikely that $\hat{\tau}$ is exactly $0$ even if $\tau=0$. So, given a value of $\hat{\tau}$ different than $0$, how confidently can we say that this is due to sampling variation versus due to $\tau$ not being $0$ at all? The answer will depend on the standard error, which means it depends on both the sample variances of the variable of interest in the two groups and the size of the two groups. In summary, if the null hypothesis is true, we would expect $Z$ to be small, and if the null hypothesis is false, we expect it to be large.
Note that the denominator of the z-statistic is $\sigma{\hat{\tau}}$ and not $\hat{\sigma{\hat{\tau}}}$, which means the z-statistic requires that you know the true sample variances/standard error rather than estimating them. Since we do not know the true sample variances, we often use a t-statistic in practice, which has the form:
$$ \boxed{t = \dfrac{\hat{\tau}-\tau}{\hat{\sigma}_{\hat{\tau}}}} $$
With a large enough sample size, the difference between the z-statistic and t-statistic becomes trivial and the methodology underlying both are the same, so we can continue to focus on the conceptual framework for the z-statistic.
Question 1: Consider the formula for the t-statistic and standard error carefully. What happens when sample variance goes up? And when it goes down? Similarly, what happens with sample size goes up? And when it goes down? Lastly, think about why the size of the numerator in the t-statistic is uninformative without the denominator (it could be useful to think about a case where the outcome of interest is a variable that can be measured in different units, for example, currency).
{r Q1} # YOUR ANSWER HERE
We have now accounted for the sampling variation of our estimator and defined a test statistic that should allow us to make a decision with respect to $H0$. But... how do we make that decision? How do we know if the value of $Z$ is "large" enough to reject $H0$? How can we use $Z$ to define a rule to reject $H_0$ or not?
We decide the value of $Z$ based on our willingness to make a mistake called a Type I error. A Type I error is rejecting the null hypothesis when the null hypothesis is true. Continuing our example, a Type I error is concluding that there is an age difference between the treatment and control groups when, in truth, there is no difference. We use the below formula to determine the value of $Z$ that is sufficiently large based on the probability of committing a Type I error $\alpha$:
$$ \boxed{Pr(Z > z^*|H_0) = 1-\Phi(z^*) = \alpha} $$ where $\Phi(\cdot)$ is the cumulative distribution function of the Standard Normal distribution. This probability $\alpha$ receives a special name, the significance level, and is conventionally set to $0.05$, although $0.10$ and $0.01$ are also values that are commonly used. As experimenters, we choose a value for $\alpha$, obtain the corresponding value of $z$ we will use for the rejection rule, which is referred to as critical value and symbolized by $z{1-\alpha}$, and compare it to the value of $Z$. Formally, $z{1-\alpha}$ is obtained from:
$$ \boxed{z_{1-\alpha} = \Phi^{-1}(1-\alpha)} $$ where $\Phi^{-1}$ is the inverse cumulative distribution of the Standard Normal distribution. Visually, for a value of $\alpha$ of $0.05$:
```{r t1, echo = FALSE, fig.align='center', fig.asp=.35, }
alpha = 0.05
x <- data.frame(x = seq(-3, 3, length = 1000))
normdist <- ggplot(data = x, mapping = aes(x = x)) + statfunction(fun = dnorm)
data <- data.frame(x = x, y = dnorm(x$x))
criticalvalue <- qt(1-alpha,df=1000) normdist + thememinimal() + geomarea(data = subset(data, x > criticalvalue), mapping = aes(x = x, y = y), fill = "blue", alpha = 0.6) + geomvline(xintercept = c(criticalvalue), size=1.5) + geomhline(yintercept = 0) + annotate("text", x = 2.5, y = 0.07, label = "alpha == 0.05", size=5, parse = TRUE, color="blue") + annotate("text", x = 2.25, y = 0.3, label = paste("z[1-alpha] == ",round(criticalvalue,3),sep=""), size= 5, parse=TRUE)+ xlab("z | H0")+ ylab("") ```
where the rejection region is to the right of the vertical line at $z{\alpha}$. Two-sided alternatives can be approached similarly, but in this case, the probability $\alpha$ is distributed on both sides. Since the rejection region is the same size for one- and two-sided alternatives, but is split between two parts of the distribution for two-sided alternatives, two-sided tests are more stringent. We reject $H0$ if $|Z| > z_{1-\frac{\alpha}{2}}$ as shown below.
```{r t2, echo = FALSE, fig.align='center', fig.asp=.35, }
set.seed(888)
alpha = 0.05
x <- data.frame(x = seq(-3, 3, length = 1000))
normdist <- ggplot(data = x, mapping = aes(x = x)) + statfunction(fun = dnorm,)
data <- data.frame(x = x, y = dnorm(x$x))
criticalvalue <- qt(1-alpha/2,df=1000) normdist +
thememinimal() + geomarea(data = subset(data, -x >
criticalvalue), mapping = aes(x = x, y = y), fill = "blue", alpha =
0.6) + geomarea(data = subset(data, x > criticalvalue),
mapping = aes(x = x, y = y), fill = "blue", alpha = 0.6) +
geomvline(xintercept = c(criticalvalue), size=1.5) +
geomvline(xintercept = -c(criticalvalue), size=1.5) +
geomhline(yintercept = 0) + annotate("text", x = 2.7, y = 0.08,
label = "frac (alpha,2) == 0.025", size=5, parse = TRUE, color="blue") +
annotate("text", x = -2.5, y = 0.08, label = "frac (alpha,2) == 0.025",
size=5, parse = TRUE, color="blue") + annotate("text", x = 2.6, y = 0.3,
label = paste("z[1-frac(alpha,2)] ==
",round(criticalvalue,3),sep=""), size= 5, parse=TRUE)+
annotate("text", x = -2.7, y = 0.3, label = paste("-z[1-frac(alpha,2)]
== ",round(-criticalvalue,3),sep=""), size= 5, parse=TRUE)+
xlab("z | H_0")+ ylab("") ```
If we reject a null hypothesis stating that a parameter is different from $0$, we say that the quantity of interest is statistically significant.
Note that as we increase $\alpha$, our criterion to reject becomes less and less stringent. That is, we don't need that large of a $Z$ (or $|Z|$, if two-sided) to reject. As a result, we will reject more often, and thus it will be increasingly likely that we reject a null hypotheses that is true (i.e., we would be more prone to the Type I error). However, we will be less likely of failing to reject null hypothesis that are false. This type of error is called Type II error, symbolized as $\beta$, and will play a key role in the design of our experiments, as we will see later in this tutorial.
Note: A test with an alternative hypothesis of the form $\tau > 0$ is called upper or right-tailed test. The other type of one-sided test is the lower or left-tailed test, and it is when we test $H0$ against an alternative $\tau < 0$. The results and procedures are exactly the same, except that we care about comparing $Z$ with $z{\alpha}$, instead of with $z{1-\alpha}$, and we would reject the null hypothesis if $Z < z{\alpha}$. For the remainder of this tutorial, we will mean right-tailed tests when we refer to one-sided tests, but be mindful of this difference.
A $p$-value is the probability of finding a value that is more extreme than a statistic under the null hypothesis. Packages like stats report $p$-values because they are more readily interpretable than the value of, for example, the $z$-statistic. A $p$-value is usually compared to the value of $\alpha$ to determine whether one should reject $H0$ or not. This comparison is exactly equivalent to comparing $T$ and $t{1-\alpha}$ (or $|T|$ and $t{1-\frac{\alpha}{2}}$, if two-sided). More extreme values of $T$ (or $|T|$) imply lower $p$-values, and the $p$-value is less than the significance level $\alpha$ (meaning we can reject the null hypothesis) if and only if $T < t{1-\alpha}$ (or $|T| < t_{1-\frac{\alpha}{2}}$, if two-sided). In practice, we typically report results using $p$-values and comparing them to some significance level $\alpha$ rather than showing the test statistic. $p$-values less than $\alpha$ means we can reject the null hypothesis.
Confidence intervals are another way to quantify the uncertainty about our estimation. Confidence intervals for the difference in means estimator are constructed with the value of the estimator as the midpoint, and then defining a lower bound and an upper bound at equal distance. This distance is given by the estimated standard error of the estimator times $t_{1-\frac{\alpha}{2}}$, for the $(1-\alpha)$% level confidence interval.
$$ (\hat{\tau} \pm \hat{\sigma}{x} \times t{1-\frac{\alpha}{2}}) $$
The confidence level $1-\alpha$ indicates that if we repeat the same procedure many times, the constructed interval will contain the true value of the parameter about a fraction $1-\alpha$ of the times. Note that it does NOT mean that the confidence interval will contain the parameter with $1-\alpha$ probability, because the true value of the parameter is fixed, and the interval either contains it or not.
Confidence intervals are used to understand the uncertainty about the estimation of a parameter, giving us an idea of how precise our estimation is. Narrower confidence intervals mean more precise estimations, and wider confidence intervals indicate the opposite. We can reject the null hypothesis that $\tau=0$ when the confidence interval does not contain zero.
Question 2: Holding everything else constant, what does increasing the sample size do to the confidence interval? Why?
{r Q2} # YOUR ANSWER HERE
Randomized experiments are considered the gold standard for evaluating interventions because the randomization process ensures that the groups we are comparing are similar in every characteristic that might affect the outcomes of interest, except for the intervention they receive. This important feature makes the analysis more direct, usually model-free, and also more easily interpretable. Essentially, if we see differences between groups, we can confidently attribute those to the interventions. Importantly, for this to be true, the randomization must be carried out carefully. There are many ways that a random assignment can go wrong, and when this happens, the analysis and interpretations can become complicated. In an extreme case, it might render the experiment useless. Unfortunately, it is not always in the experimenter's power to ensure that randomization goes smoothly. Therefore it is important to conduct some checks to evaluate this aspect of the experiment.
Consider our dataset as an example. Remember that households were
randomly assigned into treatment and control groups (as indicated by the
variable treat_pseudo
), but that we are using
treat_real
, which is based on whether the households had
their phone listed, as our treatment variable. Suppose that
treat_pseudo
was correctly implemented, so that households
in both groups are similar. We could still have a problem because
treat_real
was not randomly assigned. Ideally,
having their phone listed or not would be independent of whether the
household voted (the outcome of interest), and the proportions of
households with their phone listed would be the same among treatment and
control groups. We saw above that this latter statement is not true. Is
this enough of a problem to resort to more complicated analysis methods
or can we safely treat our data as experimental data?
This situation is where covariates play an important role. Remember that covariates are "pre-treatment" variables and, as such, cannot be affected by the treatment. If the randomization was carried out correctly, we should not observe differences between covariates across the groups. For example, the proportion of females or the proportion of voters that are newly registered should be similar across groups. If not, we will not be able to claim that differences in the outcomes of interest are due to the interventions: they might very well be due to differences in these covariates. When we cannot detect differences in pre-treatment variables across groups, we say that the groups are balanced, and we call the checks to determine whether we have balanced groups are called balance checks.
Note that even if we are certain that the randomization process went well, we will always see numerical differences in the covariate values due to the fact that we are sampling a finite number of observations from some population of interest. For example, if we compare the average age across treatment and control in the competitive districts of Iowa:
{r pre_balance} filter(filtered_df, strata == "Iowa C") %>% group_by(treat_real) %>% summarize(mean_age = mean(age, na.rm = TRUE), .groups = 'drop') %>% kable(format.args = list(big.mark = ",")) %>% kable_styling(bootstrap_options = "striped")
How can we determine if this difference of $0.11$ in the sample means is large or not? The standard way to do it in experimental analysis is through hypothesis testing, a topic which we cover next.
After learning about statistical tests, we are equipped to address our concerns about our experiment being balanced. A balance table is a table that shows the sample means of the different covariates for the two treatment groups, their standard errors, and the results of a $t$-test comparing them. (Remember a $t$-test is equivalent to a $z$-test in large enough samples. At small samples, however, it is more appropriate to use a $t$-test, so our default choice when implementing tests is the $t$-test.) Ideally, if randomization was carried out correctly, we would fail to reject any of the null hypotheses corresponding to the difference in the means. The code below produces a balance table for some of the covariates from the households of the Iowa competitive stratum.
{r balance_table} balance_strata <- map_dfr( .x = "Iowa C", # Select the stratum .f = function(y) # list each covariate we want to compute a t-test for map(c("age", "persons", "newreg", "vote00", "vote98", "female"), # specify function .f = function(x) cbind( # we want a data from output tidy( # run t-test for each covariate x and stratum y t.test(as.formula(paste(x, "~ treat_real")), data = subset(filtered_df, strata==y)) ), # make columns for covariate and stratum so we can see which t-tests # relate to which outputs covariate = x, strata = y ) )) %>% # select our variables of interest select(covariate, estimate1, estimate2, p.value) %>% # rename estimate so they are more clear rename(mean_control = estimate1, mean_treat = estimate2) %>% # arrange by covariate and stratum arrange(covariate) %>% # reorder column order select(covariate, mean_control, mean_treat, p.value) # generate table kable(balance_strata, # format numbers format.args = list(big.mark = ",", scientific= F, digits = 3)) %>% # format table kable_styling(bootstrap_options = "striped")
Optional: test your comprehension. Try constructing
the balance tables for the other strata. What can you say with respect
to the balance of the two groups within each stratum? Remember the
identifiers we used to label each stratum: Iowa NC
,
Iowa C
, Michigan NC
, and
Michigan C
.
```{r Opt_2}
```
When looking at balance, even if the randomization was carried out
correctly, you may encounter some statistically significant differences
between treatment groups, like newreg
in the table above.
This can happen especially if the sample size is large enough to pick
small differences. Another way to support the evidence of balance is to
look at the size of the differences to see if they are meaningful. To
evaluate this question, we must have some kind of standard of
comparison. One way to make this comparison is with standardized mean
differences:
$$ \boxed{\widehat{SMD} = \dfrac{\hat{\mu}1 - \hat{\mu}0}{\sqrt{\hat{\sigma}1^2+\hat{\sigma}0^2}}} $$ where $\mu1$ and $\mu0$, and $\sigma^21$ and $\sigma0^2$ are the population means and the variances of the two groups, respectively. This estimator puts the difference in means relative to the variability of the two groups, and thus normalizes the scale. Below you can see a plot with the standardized mean differences of the covariates in our data, for the competitive Iowa stratum.
{r balance_plot} # Compute SMD by stratum love.plot(treat_real ~ age + persons + newreg + vote00 + vote98 + female, data = filter(filtered_df, strata == "Iowa C"), binary = "std", # display standardized mean differences thresholds = c(m = .1), colors =c(colorblind_pal()(8)[c(12)]))+ scale_x_continuous(breaks = seq(-0.1, 0.1, .1)) + # specify theme aesthetics theme_minimal() + # specify text size and face theme(strip.text = element_text(size = 12), axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust=1), axis.text.y = element_text(size = 12), axis.title = element_text(size = 12, face = "bold"), legend.position = "none", plot.title = element_text(size = 16, face = "bold", hjust = 0.5)) + # specify labels labs(title = "Covariate Balance", x = 'Standardized Mean Difference', y = "Covariate", caption = "For observations with a listed phone number, treatment - control.")
Since $\widehat{SMD}$ does not follow a well defined probability
distribution, there is not an equivalent machinery as for the regular
difference in means to use it as a test statistic. However, it is a
well-adopted rule of thumb that estimated standardized mean differences
less than $0.25$ (in absolute value) suggest balance. The estimated
$SMD$s for all the covariates fall into this criterion, including that
of newreg
.
Optional: test your comprehension. Try computing the
standardized mean differences for all the covariates in each stratum.
Does this provide evidence in favor of the randomness of the assignment?
Remember the identifiers we used to label each stratum:
Iowa NC
, Iowa C
, Michigan NC
, and
Michigan C
.
```{r Opt_3}
```
To finalize this subsection, it is important to note that we can only check those variables that we observe. Randomization can be faulty in ways we cannot measure, and we would not have a way to identify them. If the assignment is truly random, however, those unobservable variables will be similar across groups by design.
In this section, we will focus on the estimation and inference about the most common parameter of interest in experimental work, the average treatment effect (ATE) of the intervention on a given outcome or outcomes.
The average treatment effect between two treatment arms $A$ and $B$
on a given outcome of interest is defined as the difference between the
mean of the outcome for the observations assigned to arm $A$ minus the
mean of the outcome for the observations assigned to arm $B$. Note that
$A$ and $B$ can be treatment (an intervention) and control like in our
example, but can also be two different interventions. If treatment and
control groups are similar, our well-known sample difference
$\hat{\tau}$ is an unbiased estimator of the quantity of interest. We
have seen how to compute it, how to calculate its standard error, and
how to test the null hypothesis that the quantity it estimates is equal
to $0$ in a previous subsection, and we can apply the same machinery
here. Below, we use the difference_in_means
function from
the estimatr package to estimate the average treatment effect
of calling registered voters in competitive Iowa on them actually going
to vote. This function is useful because it allows us to easily subset
the data, and conveniently provides inference-related quantities such as
the standard error or confidence intervals.
{r difference_in_means} # Use the estimatr package for difference in means estimates difference_in_means(vote02 ~ treat_real, data = filter(filtered_df, strata == "Iowa C"),alpha = 0.05) %>% # make output a data frame tidy() %>% # hide the degree-of-freedom column select(-df) %>% # generate table and format numbers kable(digits= 3, format.args = list(big.mark = ",", nsmall = 3, scientific = F)) %>% # format table kableExtra::kable_styling(bootstrap_options = "striped")
Note that we have three different but equivalent ways that lead to the same conclusion: we fail to reject the null hypothesis (with $\alpha = 0.05$). The absolute value of the test statistic (which we had called $T$) is smaller than the critical value $t_{0.975} \approx 2$. The confidence interval includes $0$. The $p$-value is larger than $\alpha$. All three rejection rules equivalently lead to the same conclusion.
Optional: test your comprehension. Try obtaining the
estimates for the treatment effect in the other strata. Do you see any
differences with respect to competitive Iowa? Remember the identifiers
we used to label each stratum: Iowa NC
,
Iowa C
, Michigan NC
, and
Michigan C
.
```{r Opt_4}
```
Suppose that we care instead about the treatment effect of calling on the voter turnout in the sample as a whole, as opposed to in the strata separately. How do we estimate this quantity? In the case of our example, we must take into account the important fact that the assignment to treatment of the experimental units was performed by strata, also called "blocks". "Stratifying" or "blocking" is a design technique to get more precise estimates of the average treatment effect by using prior knowledge about the outcome variable and how it varies across strata. For example, if the outcome variable takes very low values in one stratum, and very high values on the other, we will do better by estimating treatment effects separately and then aggregating them, instead of estimating the average treatment effect using all the observations jointly. A stratified assignment might be costly to implement sometimes. Therefore, a good experimental design should compare the gains in precision that the stratification would have versus the cost it would represent, and decide if it is worth it. You can read more about why we might account for stratified, or blocked random assignment on the DeclareDesign website.
The difference_in_means
function we just used has a
blocks
option to indicate the presence of strata.
{r blocked_difference_in_means} # we use parameter 'blocks' to indicate strata difference_in_means(vote02 ~ treat_real, data = filtered_df, blocks = strata) %>% # format output to table tidy() %>% # hide the degree-of-freedom column select(-df) %>% # generate table and format numbers kable(digits =3, format.args = list(big.mark = ",", nsmall = 3, scientific = F)) %>% # format table kable_styling(bootstrap_options = "striped")
We fail to reject the null hypothesis for the average treatment effect across all the strata.
Optional: test your comprehension. The estimator we use for the average treatment effect when accounting for the strata is a weighted average of the individual difference in means estimators. Think about what those weights could be, compute the stratified estimator as a weighted average, and check that it matches our previous computation.
```{r Opt_5}
```
Optional: test your comprehension. What would have happened if we estimated the average treatment effect of the calls for all the households jointly without accounting for the stratification? Would our conclusions change? Would this be a valid way of estimating the average treatment effect? Why?
{r Opt_6} # YOUR ANSWER HERE
In this section, we discuss how to make one critical decision in experimental design: the target sample size. To see why this is crucial, remember the formula for the standard error of the difference in means estimator:
$$ \boxed{\sigma{\hat{\tau}} = \sqrt{{Var}[\hat{\tau}]} = \sqrt{\dfrac{\sigma1^2}{n1} + \dfrac{\sigma^20}{n_0}}} $$
For the purpose of illustration, let us simplify it a bit. Assume that $\sigma1^2 = \sigma0^2 = \sigma^2$ and $n1 = n0 = n$. Then, we have:
$$ \boxed{\sigma_{\hat{\tau}} = \sqrt{{Var}[\hat{\tau}]} = 2\sigma/\sqrt{n}} $$ As we saw in the previous section, the standard error determines the precision of our estimation. Larger standard errors lead to lower precision, while smaller standard errors lead to higher precision. From the formula above, we see that there are two elements that determine the size of the standard error: $\sigma^2$ and $n$. $\sigma^2$ is the variance of the outcome of interest, and it is outside the experimenter's control. $n$, the sample size, however, is under the experimenter's control, something we can choose. Increasing the sample size reduces the standard error and increases the precision of the estimation, which is good. Unfortunately, it does not come for free, as getting a larger sample usually increases the cost of the experiment. Therefore, it becomes key to the experimental design to get an idea of how large $n$ has to be for the experiment to make sense. We will go over two intimately related concepts that will help us make this decision: the power of an experiment, and the minimum detectable effect.
Let us revisit two concepts from the Hypothesis Testing section: the Type I error and the Type II error. The Type I error is the mistake we make by rejecting a null hypothesis when it is actually true. We symbolized the probability of making this mistake as $\alpha$, and named it the significance level. The Type II error is the mistake we make by failing to reject a null hypothesis when it is false and the proposed alternative is true. We symbolized the probability of making this mistake as $\beta$ (and is usually referred to as risk).
The power of an experiment is the probability of rejecting a null hypothesis when it is false, and the proposed alternative is true. It is, naturally, $1 - \beta$. All three concepts are closely related. In fact, holding everything else constant, fixing the value of one of them fixes the values of the other two. While this is clear when it comes to $\beta$ and $1-\beta$, we will do some work to understand the relationship between these two and $\alpha$.
Suppose that we have the following hypotheses about the difference in means:
$$ \boxed{H0: \tau = 0\ H1: \tau > 0} $$
Our decision rule consists of rejecting $H0$ if our $Z$ statistic is larger than some critical value $z{1-\alpha}$. What, then, is the power of our experiment? Let us start by stating what the power is formally for this case of a one-sided alternative hypothesis. Remember the definition of power: the probability of rejecting the null hypothesis when it is false, in favor of the alternative. Formally:
$$ \boxed{Pr(Z \geq z{1-\alpha} | HA ) = 1-\beta} $$
After doing some math, we can express this as:
$$ \boxed{1-\beta = \Phi\left( \frac{\tau}{2\sigma/\sqrt{n}} - z_{1-\alpha} \right)} $$
Proof
To derive the formula for power, plug the components $Z$ in to
$\Pr\left( Z \ge z{1-\alpha} | HA \right)$. $$ \Pr\left
(\frac{\hat{\tau} }{2\sigma/\sqrt{n}} \ge z{1-\alpha} | HA
\right) $$ Under $HA$, $\frac{\hat{\tau}-\tau }{2\sigma/\sqrt{n}}
\sim N(0, 1)$. With a bit of algebra on both sides of the inequality,
this gives us $$ 1 - \Pr\left( \frac{\hat{\tau}-\tau }{2\sigma/\sqrt{n}}
\le z{1- \alpha} + \frac{ - \tau }{2\sigma/\sqrt{n}}\right)\ 1 -
\Phi\left( z{1- \alpha} + \frac{ - \tau }{2\sigma/\sqrt{n}}\right)\
= \Phi\left( \frac{\tau }{2\sigma/\sqrt{n}} - z{1-\alpha} \right)
$$ where in the last line we used the fact that the standard normal
distribution is symmetric.
The definition of power contains three major elements: $\tau$, $z_{1-\alpha}$, and $\sigma/\sqrt{n}$. Let us examine what this formula is telling us, element by element, with respect to the power of our experiment. Note that, because it is a cumulative distribution function, higher values inside of $\Phi(\cdot)$ imply higher values for $\Phi(\cdot)$ (it is increasing in its argument)!
First, power is higher when $\tau$ is larger. Intuitively, this relationship makes sense because the larger the difference, or the bigger the treatment effect, the easier it is to detect. The figure below shows the value of the power for different values of $\tau$ holding constant $\alpha$, $\sigma$, and $n$.
```{r pwer1, echo = FALSE, fig.asp=.35, fig.align='center'}
alpha = 0.05 sigma = 20 n = 100
powercalc = data.frame(tau = seq(1, 10, length = 1000)) powercalc$powr = power.t.test(n=n, sd = sigma, delta=power_calc$tau, sig.level=alpha, type = "one.sample", alternative ="two.sided")$power
ggplot(powercalc, aes(x = tau, y = powr)) + geomline() +
thememinimal() + # adjust size and face of text theme(strip.text = elementtext(size = 12), axis.text = elementtext(size = 12), axis.title = elementtext(size = 12, face = "bold"), legend.position = "none", plot.title = element_text(size = 16, face = "bold", hjust = 0.5)) + # specify labels labs(title = "", x = ' True difference in means', y = "Power")
```
Second, power is higher when $z{1-\alpha}$ is lower, implying that $\alpha$ is higher. This relationship is also fairly intuitive. As it becomes easier to reject the null hypothesis, we need fewer observations in order to do so. On the other hand, we are also increasing the chance of rejecting a true $H0$, which is exactly the definition of $\alpha$! Put differently, there is a trade-off between the probabilities of a Type I error and a Type II error, $\alpha$ and $\beta$. The figure below shows the value of the power for different values of $\alpha$, holding constant $\tau$, $\sigma$, and $n$.
```{r pwer2, echo = FALSE, fig.asp=.35, fig.align='center'}
tau = 5 sigma = 20 n = 100
powercalc = data.frame(alpha = seq(0.01, 0.30, length = 1000)) powercalc$powr = power.t.test(n=n, sd = sigma, delta = tau, sig.level= power_calc$alpha, type = "one.sample", alternative ="two.sided")$power
ggplot(powercalc, aes(x = alpha, y = powr)) + geomline() +
thememinimal() + # adjust size and face of text theme(strip.text = elementtext(size = 12), axis.text = elementtext(size = 12), axis.title = elementtext(size = 12, face = "bold"), legend.position = "none", plot.title = element_text(size = 16, face = "bold", hjust = 0.5)) + # specify labels labs(title = "", x = 'Significance level', y = "Power")
```
Finally, power is higher when $2 \sigma/\sqrt{n}$, the standard error of the estimate, is lower. If we look at the two components of the standard error separately, that means power is higher when the standard deviation (recall standard deviation is the square root of the variance $\sigma = \sqrt{\sigma^2}$) is lower. The intuition here is that, for a given treatment effect $\tau$, greater variance implies there is higher probability that $\tau$ is large just by chance. It would be really difficult to get $\hat{\tau}=10$ if the standard deviation is $0.0001$. The figure below shows the value of the power for different values of $\sigma$, holding constant $\tau$, $n$, and $\alpha$.
```{r pwer3, echo = FALSE, fig.asp=.35, fig.align='center'}
tau = 5 alpha = 0.05 n = 100
powercalc = data.frame(sigma = seq(1, 20, length = 1000)) powercalc$powr = power.t.test(n=n, sd = power_calc$sigma, delta = tau, sig.level=alpha, type = "one.sample", alternative ="two.sided")$power
ggplot(powercalc, aes(x = sigma, y = powr)) + geomline() +
thememinimal() + # adjust size and face of text theme(strip.text = elementtext(size = 12), axis.text = elementtext(size = 12), axis.title = elementtext(size = 12, face = "bold"), legend.position = "none", plot.title = element_text(size = 16, face = "bold", hjust = 0.5)) + # specify labels labs(title = "", x = 'Standard deviation', y = "Power")
```
Lastly, power is higher when the denominator of the standard error $n$ is higher. Recall this is the one parameter which is under the experimenter's control, the one you choose. The larger your sample, the more power you have. The figure below shows the value of power for $n$, holding constant $\tau$, $\sigma$ and $\alpha$.
```{r pwer4, echo = FALSE, fig.asp=.35, fig.align='center'}
tau = 5 sigma = 20 alpha = 0.05
powercalc = data.frame(n = seq(1, 1000, length = 1000)) powercalc$powr = power.t.test(n=power_calc$n, sd = sigma, delta = tau, sig.level=alpha, type = "one.sample", alternative ="two.sided")$power
ggplot(powercalc, aes(x = n, y = powr)) + geomline() +
thememinimal() + # adjust size and face of text theme(strip.text = elementtext(size = 12), axis.text = elementtext(size = 12), axis.title = elementtext(size = 12, face = "bold"), legend.position = "none", plot.title = element_text(size = 16, face = "bold", hjust = 0.5)) + # specify labels labs(title = "", x = 'Sample size', y = "Power")
```
Let us summarize the takeaways:
After learning about power, we can ask ourselves how we can use this knowledge to inform our experimental design. It turns out that there is more than one way to look at the power dimension, and the one that is best will depend on the specifics of each particular case, and on the prior knowledge of the setting. Power analysis can be used to answer three relevant questions, which we will look into in detail.
The first question we can ask is: "Given a sample size $n$ and a standard deviation of the data of $\sigma$, what is the power of my experiment to detect an effect of $\tau$ with confidence level $(1-\alpha)$ in a one-sided (or two-sided) test?
Note: In experimental jargon, detecting an effect is another way of saying that we reject the hypothesis of the effect of a treatment being $0$.
Notice that this question is basically taking the power equation:
$$ \boxed{1-\beta = \Phi\left( \frac{\tau}{2\sigma/\sqrt{n}} - z{1-\alpha} \right)} $$ and plugging in values for $\tau$, $\sigma$, $n$ and $\alpha$, so that the only unknown is $\beta$. We are typically interested in this question when our sample size $n$ is already set (for example, if we have a fixed budget to collect data), and we have an idea of what would be an interesting effect $\tau$ to detect. Finally, we can always control the choice of $z{1-\alpha}$, but we do need some information about $\sigma$. It is often the case that we form an idea of what $\sigma$ is by looking at pilot data.
By convention, a value of power greater than $0.8$ is considered
acceptable. The package stats contains the function
power.t.test
, which we can use to compute the power of a
two sample test given all the other values. Let us see an example of how
it works. We ask the following question: Given a sample size of $100$
and a standard deviation of the data of $10$, what is the power of an
experiment to detect an effect of $3$ with a confidence level of $95$%
in a one-sided test?
```{r powercalc1} powerexample1 = power.t.test(n = 100, delta = 3, sd = 10, power = NULL, sig.level = 0.05, type = "two.sample", alternative = "one.sided")
powerexample1$power ```
We see that the power of our experiment is about $0.68$. This means that if we repeated this experiment many times with those parameters values being true, we would reject the null hypothesis about 68% of the times.
Optional: test your comprehension. How does modifying our power tests for a two-sided hypothesis change our power?
{r Opt_7} # YOUR ANSWER HERE
The second question we can ask is: "Given a sample size $n$ and a standard deviation of the data of $\sigma$, what is the minimum effect size $\tau$ we can detect with a power of $(1-\beta)$ and confidence level $(1-\alpha)$ in a one-sided (or two-sided) test?
Now, we are taking the power equation:
$$ \boxed{1-\beta = \Phi\left( \frac{\tau}{2\sigma/\sqrt{n}} - z_{1-\alpha} \right)} $$
and plugging in values for $\beta$, $\sigma$, $n$ and $\alpha$, so that the only unknown is $\tau$. The value we obtain for $\tau$ is known as the minimum detectable effect (MDE) for that set of parameter values, and essentially is the lowest effect we can hope to reject the null hypothesis at confidence level $(1-\alpha)$%, $(1-\beta)$% of the times if we were to run the experiment repeatedly. It does NOT mean that we will not be able to detect an effect smaller than the MDE. Rather, we will just have a lower probability of detecting that effect size. On the flip side, we have a higher than $(1-\beta)$% chance of detecting an effect size that is larger than $\tau$.
Question 3: Use the function
power.t.test
to answer the following question. Given a
sample size of $500$ and a standard deviation of the data of $20$, what
is the MDE of an experiment with confidence level of $95$% and $80$%
power in a one-sided test? Remember that the significance level is $1 -
$ confidence level.
```{r Q3}
```
Optional: test your comprehension. What happens to the minimum detectable effect as when we change the other parameters? Why?
{r Opt_8} # YOUR ANSWER HERE
The third and last question we can ask to inform our design is the following: "Given a standard deviation of the data of $\sigma$, what is the needed sample size $n$ to detect an effect of $\tau$ with a power of ($1-\beta$) and confidence level $(1-\alpha)$ in a one-sided (or two-sided) test?
As you probably guessed, we are taking the power equation:
$$ \boxed{1-\beta = \Phi\left( \frac{\tau}{2\sigma/\sqrt{n}} - z_{1-\alpha} \right)} $$
and plugging in values for $\beta$, $\sigma$, $\tau$ and $\alpha$, so that the only unknown is $n$. When we have an idea of what would an interesting effect size be and are planning the data collection process, having an answer to this question can determine whether or not an experiment is feasible. Budget constraints are always a concern, but it can be just as true that the size of a population limits the feasibility of an experiment. If we want to learn about the effect of a change in policy at Stanford GSB on Stanford GSB students, then our maximum sample size is the population size of Stanford GSB students.
Optional: test your comprehension. Use the function
power.t.test
to answer the following question. Given a
standard deviation of the data of $20$, what is the required sample size
in an experiment to detect an effect of $10$ with confidence level of
$95$% and $80$% power in a one-sided test? Remember that the
significance level is $1 - $ confidence level.
```{r Opt_9}
```