This exercise is designed to give you a practical application of randomization and power calculations in Stata and R. In this exercise, you’ll replicate the random assignment process for a randomized evaluation of two education programs in India: Continuous and Comprehensive Evaluation (CCE) and Teaching at the Right Level (TaRL). You’ll implement clustered and stratified random assignment, assess baseline balance, and calculate statistical power. The goal of the exercise is to familiarize yourself with how to conduct randomization and power calculations for complex experimental designs.
📚 Berry, James, Priya Mukherji, Shobhini Mukherji, and Marc Shotland. (2018). Failure of Frequent Assessment: An Evaluation of India’s Continuous and Comprehensive Evaluation Program. J-PAL South Asia.
To get started, make a copy of this dropbox folder containing the original data from the study, as well as the exercise coding files, on your local machine. If you are working in Stata, open the do-file RST_rand_power.do. If you are working in R, open the R script RST_rand_power.R.
Instructions: Install the necessary packages you’ll be working with and set your working directory at the top of the file.
The following code chunk can be used at the beginning of your
file to set up your environment. The command clear all
removes all data, saved results, and stored programs from memory,
set more off
prevents Stata from pausing output, and
version 16
ensures your code runs under Stata 16’s syntax
and behavior (note that in this case, your version of Stata must be
version 16 or greater).
Below, we’ll install two packages that we’ll leverage later in the
exercise. distinct
is used to count the number of unique
values in a variable, and randtreat
is used to generate
randomized treatment assignment. You can find additional documentation
for the randtreat
command here.
* Install required packages
ssc install distinct
net install randtreat, from("https://raw.github.com/acarril/randtreat/master/") replace
Finally, set your working directories to the folders you’ll be using
using Stata’s global
function. Add the location for the
main folder you downloaded between the quotation marks after
global main
.
The study was conducted across 500 lower and upper primary government schools in two districts of Haryana, India, during the 2012–13 academic year. The intervention focused on improving foundational learning by testing two programs—CCE, a policy emphasizing regular assessments and student tracking, and TaRL, an instructional approach that tailors teaching to children’s learning levels. The randomized evaluation of the program was based on a factorial design, where schools were randomly assigned in equal proportions to one of four groups.
Treatment Group | Description |
---|---|
T1 | Comparison group: Schools within a school-campus assigned to this group follow the standard government curriculum. |
T2 | CCE only: Schools within a school-campus assigned to this group recieve CCE only. |
T3 | TARL only: Schools within a school-campus assigned to this group receive TARL only. |
T4 | CCE + TARL: Schools within a school-campus assigned to this group receive both TARL and CCE. |
The authors of the study collected data on student learning outcomes at baseline and endline, as well as teacher and school characteristics. While the real study conducted two experiments in lower and upper primary schools, this exercise will only use the lower primary school data, labeled as primary_cleaned.dta. The RCT evaluated the impact of the programs on four learning outcomes: oral Hindi, oral math, written Hindi, and written math.
Toggle the codebook section below to see some of the main variables we’ll be using in the exercise.
Data |
Variable in primary_cleaned.dta
|
---|---|
School campus | super_school_id |
Original stratum | stratum |
Female | female |
Age (years) | age_years |
Grade in 2011–12 school year | base_standard |
Oral Hindi score at baseline (standardized) | base_aser_read_norm |
Oral math score at baseline (standardized) | base_aser_math_norm |
Written Hindi score at baseline (standardized) | base_h_score_norm |
Written math score at baseline (standardized) | base_m_score_norm |
In this section of the exercise, you will replicate the randomization of the original study. Note that even though you are conducting the same randomization process, you will not get the exact same results as the paper since randomization is–you guessed it–random!
The goal is to randomly assign schools to one of each of the 4
treatment groups, clustering at the school campus level and stratifying
by the original stratum
variable created by the authors.
This can be done manually, or by using packages like
randomizr
in R or randtreat
in Stata.
stratum
variable was constructed by first grouping
schools by block and school type (lower, upper, or both). Within each
group, school campuses were sorted by average baseline test scores and
grouped into strata of four school campuses each to enable stratified
random assignment across four treatment arms.
Misfits happen when, during the randomization process, the number of units in a group (like a stratum) doesn’t divide evenly across the treatment groups. For example, if you have 5 units and want to assign them evenly to two groups (treatment and comparison), one unit will be left over—this is a misfit.
In simple cases, such as random assignment to two groups without stratification, you can just randomly assign that leftover unit after dividing up the rest. But when working with multiple strata, you need to decide whether to balance treatment assignments within each stratum or across the whole sample [**ADD NOTE ON GLOBAL V. LOCAL ALLOCATION]
Check out this paper by Carril (2017) for more details on dealing with misfits.Instructions: Randomize assignment of school
campuses to four groups of equal proportion, stratifying on the
stratum
variable and handling misfits locally. Set your
seed to 12345
.
The code below imports the lower primary school data from the
study, referencing the global
for the data folder you set
above with a $
.
The chunk below uses Stata’s preserve
and
restore
functions to conduct the random assignment without
altering the original dataset. The code chunk should be run all
together, rather than line by line. Since the data is at the
student level and we’ll be conducting randomization at the school campus
level, we keep only unique school campuses.
We then randomize treatment using the randtreat
command
we installed earlier:
generate
: tells Stata to generate the variable
treatment
which contains the treatment assignment for each
school campusstrata
: asks Stata to stratify by the
stratum
variablemisfits
: allows you to decide how to allocate misfits,
in this case, locallymultiple
: option for number of equal-sized groupssetseed
: allows you to specify a seed for the
randomization. This is important so that anybody who runs your code can
exactly replicate your randomization!We’ll then save the results of the randomization in our the temp folder we specified earlier.
* Random assignment
preserve
keep super_school_id stratum
duplicates drop
sort super_school_id
randtreat, generate(treatment) strata(stratum) ///
misfits(strata) multiple(4) setseed(12345)
save "$temp/school_campus_treatment.dta", replace
restore
Below, we’ll merge the random assignment of school campuses back into
our original dataset using super_school_id
, and label the
comparison and treatment groups accordingly. Check out your
treatment
variable using the tab
command.
* Merge original data with the generated random assignment
merge m:1 super_school_id using "$temp/school_campus_treatment.dta"
drop _merge
* Label the treatment groups
label define trt 0 "Control" 1 "CCE" 2 "RE" 3 "CCE+RE"
label values treatment trt
tab treatment, m
We’ll then generate two binary dummies for whether students are assigned to a group with any CCE program (i.e., treatment groups 1 and 3) or any TARL program (i.e., treatment groups 2 and 3), as well as binary dummies for each treatment group. These will be used in the next section for our summary table.
Since you will need to do this a few times throughout the exercise,
we’ll nest our code to create a very simple program called
treatment_dummies
. You can call the program later on by
running treatment_dummies
.
* Create program for generating binary dummies for treatment status
program define treatment_dummies
gen any_CCE = .
replace any_CCE = 0 if inlist(treatment, 0, 2)
replace any_CCE = 1 if inlist(treatment, 1, 3)
gen any_TARL = .
replace any_TARL = 0 if inlist(treatment, 0, 1)
replace any_TARL = 1 if inlist(treatment, 2, 3)
gen CCE = .
replace CCE = 1 if treatment == 1
replace CCE = 0 if inlist(treatment, 0, 2, 3)
gen TARL = .
replace TARL = 1 if treatment == 2
replace TARL = 0 if inlist(treatment, 0, 1, 3)
gen CCE_TARL = .
replace CCE_TARL = 1 if treatment == 3
replace CCE_TARL = 0 if inlist(treatment, 0, 1, 2)
end
* Run the program & check your results
treatment_dummies
foreach i in any_CCE any_TARL CCE TARL CCE_TARL {
tab `i', m
}
Baseline balance refers to whether the treatment and comparison groups are similar at the start of the evaluation, before the program begins.
When we randomly assign units (such as individuals, households, or schools) to treatment and comparison groups, the goal is to create groups that are statistically similar on average—that is, balanced in terms of characteristics that could influence the outcome the program aims to improve (in this case, learning outcomes).
Even with proper randomization, small differences between groups can occur due to chance. This is expected. However, large or systematic differences may suggest problems with the randomization process or the implementation of the assignment.
So, how do we check for baseline balance? Researchers typically test for balance by comparing the means of key baseline variables across treatment groups. This is often done using regression analysis or t-tests to see whether any differences are statistically significant.
Instructions: Estimate treatment and control differences in the key baseline variables included in the codebook above by running a regression of each baseline variable on a dummy for each treatment arm, controlling for the stratification used in the randomization. Print the coefficients and standard errors in a summary table.
There are multiple ways to code up your balance table. You can
start with including all of your baseline variables in a
global
called outcomelist_base
.
* Define list of baseline variables
global outcomelist_base "female age_years base_standard base_aser_read_norm base_aser_math_norm
base_h_score_norm base_m_score_norm"
We’ll then run a loop that goes through each variable in
outcomelist_base
and does the following:
var
var
on the three treatment
indicators: CCE
, TARL
, and
CCE_TARL
var
in the
comparison groupbalance_table
You must run the chunk below all at once!
* Run loop for each baseline outcome
foreach var in `outcomelist_base' {
* Get number of non-missing observations
qui count if `var'<.
local nobs=r(N)
* Run balance regression
areg `var' CCE TARL CCE_TARL, cluster(super_school_id) absorb(stratum)
* Get control group mean and SD (no CCE, no TARL)
qui sum `var' if e(sample)==1 & any_CCE==0 & any_TARL==0
local control_mean=r(mean)
local control_sd=r(sd)
* Append results to matrix
mat balance_table = nullmat(balance_table) \ ///
`control_mean', _b[CCE], _b[TARL], _b[CCE_TARL] \ ///
`control_sd', _se[CCE], _se[TARL], _se[CCE_TARL]
}
After that, we can label our matrix and print it to the Stata console.
* Label matrix
mat balance_table=balance_table[.,1..4]
mat colnames balance_table=control CCE_only TARL_only CCE_TARL
local rowlabels
foreach var in $outcomelist_base {
local rowlabels `rowlabels' `var'_mean `var'_sd
}
matrix rownames balance_table = `rowlabels'
* Print balance table
matrix list balance_table
The resulting balance table should be similar to the first four columns in Table 1 of Berry et al. (2018). The first column shows the mean and standard deviation for each baseline variable in the comparison group. Columns 2 through 4 report the estimated difference in means between the comparison group and CCE, TARL, and CCE combined with TARL, after controlling for stratification and clustering at the school campus level.
For example, in the row for age_years
, students in the
comparison group are on average 9.03 years old. The coefficient for the
treatment group receiving CCE only is 0.070, meaning students in the CCE
group are on average 0.07 years older than those in the comparison
group. But this difference is small relative to the standard error of
0.058, and not statistically significant, suggesting
this variable is not imbalanced across groups.
In this section of the exercise, you will conduct parametric
power calculations for the study design discussed above based
on parameters estimated from the baseline data of the study. While the
real study evaluated the impact of CCE and TARL on four learning
outcomes, for simplicity we’ll focus only on oral Hindi test scores
(base_aser_read_norm
).
Keep in mind that in a real study, ex-ante power calculations should be conducted for each planned outcome of interest. If you need a refresher on the components of power calculations, please head to J-PAL’s research resource on power calculations.
In this section, we’ll use the baseline variables from the
primary_cleaned.dta
dataset to get parameters for the
comparison group and the first treatment group for CCE. We’ll need to
estimate the average number of students within school campuses, the
total number of school campuses, and the outcome variance and
intracluster correlation (ICC) of our main outcome of interest, oral
Hindi test scores, for the two groups.
Note that in a real study, we would not have access to our main outcome of interest at endline until we run the study, so we need to use the baseline oral Hindi test scores as a proxy.
Instructions: Using the treatment
variable you created earlier, estimate and store the following
parameters that you’ll use for your power calculations:
Include your main outcome variable in a global
named outcome
and the remaining baseline test scores that
you’ll use as covariates in a global
named
baseline tests
.
global outcome "base_aser_read_norm" // Oral Hindi scores at baseline
global baseline_tests "base_aser_math_norm base_h_score_norm base_m_score_norm"
We’ll first start by storing estimates for our sample size. Since the
study used a clustered design, we’ll want both the total number of
clusters in each group and the average cluster size (i.e., the average
number of students in each school campus). Use the distinct
command we installed earlier to count the total number of school
campuses in the comparison group and the CCE treatment group.
count if treatment == 0 // Comparison group
global students_c = r(N)
count if treatment == 1 // CCE treatment group
global students_t = r(N)
distinct super_school_id if treatment == 0 // Comparison group
global clusters_c = r(ndistinct)
distinct super_school_id if treatment == 1 // CCE treatment group
global clusters_t = r(ndistinct)
global cluster_size_c = floor($students_c / $clusters_c)
global cluster_size_t = floor($students_t / $clusters_t)
display "The average number of students in the school campuses assigned to the CCE treatment group is $cluster_size_c."
display "The average number of students in comparison school campuses is $cluster_size_t."
Our next step will be to get the residual standard
deviation of our outcome variable,
base_aser_read_norm
, by accounting for covariates and
strata. Remember that the residual standard deviation measures the
amount of variation in the outcome variable that remains unexplained
after accounting for the covariates included in the regression model. In
the context of power calculations, calculating the residual variance of
our outcome can provide us with an estimate of the “noise” in the
outcome variable after adjusting for known factors.
To do this, we run a regression of the outcome on baseline covariates (with stratum fixed effects and clustered standard errors), predict the residuals from this model, and calculate the standard deviation of those residuals.
Note that here we use areg
instead of reg
to avoid printing the co-efficients of each stratum.
cap drop res
areg $outcome female age_years base_standard $baseline_tests, absorb(stratum) vce(cluster $cluster_var)
predict res, res
sum res
global res_sd = r(sd)
display "The residual standard deviation is $res_sd."
Finally, we’ll calculate the ICC for our outcome. The
loneway
command runs a one-way ANOVA to
decompose the variance into between-cluster and within-cluster
components. The ICC, stored in the global macro $icc
,
measures the proportion of total variance that is due to differences
between clusters. A higher ICC indicates that outcomes within clusters
are more similar to each other, and will typically result in lower
power, all else equal.
Additional documentation for the loneway
command can be
found here.
In this section, we’ll conduct power calculations for estimating the minimum detectable effect (MDE) of the study using built-in commands from Stata and R and the parameters we estimated in section C.1. Remember that for a randomized design with more than two groups, you should conduct power calculations for every comparison, especially if you expect to have unequal sized groups.
In this study, the authors made three main comparisons at the end of the programs (for each outcome variable):
For simplicity, we’ll stick to one comparison in this section: test scores in the comparison group compared to test scores in the CCE group. Note that since our groups our roughly equal, we can expect that the power for each comparison will be roughly similar.
Instructions: Using the baseline average oral Hindi test scores and the parameters estimated in section C.1, estimate the MDE you’d be able to detect with your current randomization design. Set power to .80 and alpha (the significance level) to .05.
First, start by estimating and storing average oral Hindi test
scores at baseline. Since the authors standardized this variable, it
should be very close to 0.
Use the power twomeans
command to estimate the MDE of
your study, using the global
macros you stored in section
C.1. The command takes the following parameters:
k1
and k2
are the number of clusters in
the comparison and treatment groups, respectivelym1
and m2
are the average number of units
in each cluster in the comparison and treatment groups,
respectivelyrho
is the iccsd
is the standard deviation you want to usealpha
is the significance levelpower
is the power levelRun help power twomeans
in your Stata console for
additional documentation on the command.
power twomeans $baseline, k1($clusters_c) k2($clusters_t) m1($cluster_size_c) ///
m2($cluster_size_t) alpha(.05) power(.8) rho($icc) sd($res_sd)
power
command also has
an option to add the variation of cluster sizes with
cvcluster
.
[ADD NOTE ON INTERPRETATION]
In this section of the exercise, you will explore how different experimental design choices affect randomization, balance, and power, using the same baseline dataset from the previous sections. You will also practice estimating the MDE for binary outcome variables, as well as solve for sample size instead of the MDE.
You can check your answers using the blue dropdowns.
Throughout this section, we’ll use a power level of 0.80 and a significance level of 0.05, unless otherwise specified.Suppose that the researchers find that students, teachers, and parents in different school campuses within a village interact regularly (for example, through community meetings, shared events, or informal networks). This makes them think that randomizing at the village level could help reduce spillovers that might occur from treated school campuses to comparison school campuses within the same village.
Instructions: Complete the following exercises:
village_id
variable, randomize
four groups at the village-level, without stratifying. Set your seed to
12345
.Note: For this exercise, use the baseline standard deviation (without adjusting for covariates) — do not calculate a residual standard deviation from a regression.
Variable | Coefficient | Std. Err. | P-Value |
---|---|---|---|
CCE | -0.144 | 0.085 | 0.097 |
TARL | -0.114 | 0.109 | 0.296 |
CCE and TARL | -0.121 | 0.115 | 0.296 |
Constant | 0.062 | 0.067 | 0.358 |
Suppose that after considering the potential for spillovers within villages, the researchers also recognize a trade-off: randomizing at the village level results in a smaller number of clusters, which reduces statistical power. Moreover, further investigation shows that interactions between schools in the same campus are less frequent or less influential than initially thought — in fact, schools operate quite independently, and teachers and parents rarely share information across schools.
Given this, the researchers decide to randomize at the school level.
Instructions: Complete the following exercises:
school_id
variable, randomize four groups at
the school-level, without stratifying. Set your seed to
12345
.Note: For this exercise, use the baseline standard deviation (without adjusting for covariates) — do not calculate a residual standard deviation from a regression.
Variable | Coefficient | Std. Err. | P-Value |
---|---|---|---|
CCE | 0.061 | 0.068 | 0.372 |
TARL | -0.033 | 0.074 | 0.656 |
CCE and TARL | 0.020 | 0.075 | 0.795 |
Constant | -0.048 | 0.052 | 0.353 |
After considering the costs and logistics of program implementation, the researchers find out that providing the CCE and TARL programs to all eligible schools would be too expensive. To meet their budget, they decide to explore how using an unequal allocation ratio (with more schools in the comparison group) affects the study’s statistical power.
Instructions: Complete the following exercises:
school_id
variable, randomize four groups at
the school-level, without stratifying. Set the comparison group to have
50% of the sample, and each treatment group to have 16.7%. Set your seed
to 12345
and deal with misfits globally.Note: For this exercise, use the baseline standard deviation (without adjusting for covariates) — do not calculate a residual standard deviation from a regression.
The randtreat
command in Stata has an option for
unequal allocation of treament assignments, unequal()
. The
example below randomizes four groups, with the comparison group having
20% of the sample, the first treatment group having 40% of the sample,
and the third and fourth treatment groups having 20% of the sample
each.
help randtreat
in your Stata console for additional
documentation.
Variable | Coefficient | Std. Err. | P-Value |
---|---|---|---|
CCE | -0.060 | 0.075 | 0.427 |
TARL | -0.086 | 0.074 | 0.243 |
CCE and TARL | -0.166 | 0.061 | 0.007 |
Constant | 0.017 | 0.037 | 0.649 |
Including covariates in power calculations can improve statistical power by accounting for variation in the outcome of interest that is unrelated to the treatment. By explaining some of this variation, covariates reduce the residual variance, allowing for more precise estimates of the treatment effect.
Adjusting for covariates can be especially useful when:
Instructions: Using the same randomization as the previous section, estimate the MDE of the CCE program on oral Hindi test scores adjusting for the following covariates: gender, grade, age (in years), and baseline oral math test scores, written math test scores, and written Hindi test scores.
So far, you’ve worked with continuous outcomes, where variation is described by the standard deviation and you’ve used this to inform your power calculations. In contrast, binary outcomes take on only two values (e.g., 0/1 for “did not achieve” vs. “achieved”). Instead of using the standard deviation, power calculations for binary outcomes rely on the proportion of individuals in the comparison group with the outcome of interest. If you were doing power calculations manually with a binary variable \(Y\), which takes values 0 or 1 and has mean \(P\), the variance of \(Y\) is given by \(\text{Var}(Y) = P(1 - P)\).
Instructions: Using the same randomization as the previous section, create binary variables for each of the four learning outcomes based on their median (i.e., if an observation has a value greater than the median of the oral Hindi test scores). Then, estimate the MDE you’d be able to detect due to the CCE program for the proportion of students with test scores above the median. Don’t account for covariates.
You can create the binary versions of the learning outcome
variables using the loop code chunk below:
foreach var in base_aser_math_norm base_aser_read_norm base_m_score_norm base_h_score_norm {
cap drop median_`var'
cap drop `var'_b
egen median_`var' = median(`var')
gen `var'_b = (`var' > median_`var')
}
The Stata power
command has an option for working with
binary variables, power twoproportions
. The function works
very similarly to the power twomeans
command, however Stata
computes the variance of your outcome automatically. Run the line below
for further documentation.
Stratification and subgroup analysis are important tools for improving study design and for answering specific research questions about how impacts might vary across different groups. Stratification refers to randomization done within subgroups (such as regions, blocks, or baseline characteristics) to help ensure balance between treatment and control groups on key variables. This can improve the precision of your estimates and reduce the likelihood of baseline imbalances, especially in smaller samples or clustered designs.
Subgroup analysis involves examining treatment effects separately for different groups (e.g., by grade level, gender, or baseline ability). This helps assess whether the program works differently across these groups, but it typically comes at the cost of lower statistical power because you are working with smaller samples.
In this section, you will explore how stratification and subgroup analysis affect power and precision, and how these choices interact with your study design.
Instructions: Using the continuous oral Hindi test scores again, complete the following exercises:
12345
. Don’t
stratify your randomization.Suppose you wanted to understand whether the impact of the CCE program differs by gender. Since randomization was done at the school campus level, and there are both girls and boys in each school, we can’t stratify our randomization by gender. How would you conduct your power calculations to ensure you have enough power to detect a difference in the program impact by gender?
Instructions: Estimate the MDE for the CCE program for your subgroup analysis of girls’ oral Hindi test scores.
Suppose that you’d like to understand whether the CCE program is more or less effective for students in schools that score lower on average.
Instructions: Complete the following exercises:
q_read_campus
from
base_aser_read_norm
that splits school campuses into one of
four quartiles based on their average baseline oral Hindi test
scores.12345
. Deal with misfits globally.
You can set up your quartile variable for school campuses using
the code block below.
use "$data/primary_cleaned.dta", clear
matrix drop _all
* Create quartiles of baseline average oral Hindi test scores at the school campus level
collapse (mean) base_aser_read_norm, by(super_school_id)
* Create quartiles at the school campus level
xtile q_read_campus = base_aser_read_norm, nq(4)
* Label the quartiles for clarity
cap label define q_read_campus 1 "Q1: Lowest" 2 "Q2" 3 "Q3" 4 "Q4: Highest"
label values q_read_campus q_read_campus
* Save the campus-level quartile stratification variable
save "$temp/campus_quartiles.dta", replace
* Merge this back into the student-level dataset
use "$data/primary_cleaned.dta", clear
merge m:1 super_school_id using "$temp/campus_quartiles.dta"
drop _merge
* Check the distribution
tab q_read_campus, m
* Label the new variable
cap label variable q_read_campus "Stratum based on baseline Oral Hindi (school campus avg)"
Instructions: Now, complete the following exercises:
q_read_campus
variable you
created in the previous section, and Set your seed to
12345
. Deal with misfits locally.