A. Introduction

This exercise is designed to give you a practical application of randomization and power calculations in Stata and R. In this exercise, you’ll replicate the random assignment process for a randomized evaluation of two education programs in India: Continuous and Comprehensive Evaluation (CCE) and Teaching at the Right Level (TaRL). You’ll implement clustered and stratified random assignment, assess baseline balance, and calculate statistical power. The goal of the exercise is to familiarize yourself with how to conduct randomization and power calculations for complex experimental designs.

📚 Berry, James, Priya Mukherji, Shobhini Mukherji, and Marc Shotland. (2018). Failure of Frequent Assessment: An Evaluation of India’s Continuous and Comprehensive Evaluation Program. J-PAL South Asia.

To get started, make a copy of this dropbox folder containing the original data from the study, as well as the exercise coding files, on your local machine. If you are working in Stata, open the do-file RST_rand_power.do. If you are working in R, open the R script RST_rand_power.R.

Instructions: Install the necessary packages you’ll be working with and set your working directory at the top of the file.

Guidance for Stata users


The following code chunk can be used at the beginning of your file to set up your environment. The command clear all removes all data, saved results, and stored programs from memory, set more off prevents Stata from pausing output, and version 16 ensures your code runs under Stata 16’s syntax and behavior (note that in this case, your version of Stata must be version 16 or greater).

clear all
set more off
version 16

Below, we’ll install two packages that we’ll leverage later in the exercise. distinct is used to count the number of unique values in a variable, and randtreat is used to generate randomized treatment assignment. You can find additional documentation for the randtreat command here.

* Install required packages
ssc install distinct
net install randtreat, from("https://raw.github.com/acarril/randtreat/master/") replace

Finally, set your working directories to the folders you’ll be using using Stata’s global function. Add the location for the main folder you downloaded between the quotation marks after global main.

* Set working directories 
global main "SET MAIN WORKING DIRECTORY HERE"
    global data "$main/dta"
    global temp "$main/temp"
Guidance for R users

A.1 Study context

The study was conducted across 500 lower and upper primary government schools in two districts of Haryana, India, during the 2012–13 academic year. The intervention focused on improving foundational learning by testing two programs—CCE, a policy emphasizing regular assessments and student tracking, and TaRL, an instructional approach that tailors teaching to children’s learning levels. The randomized evaluation of the program was based on a factorial design, where schools were randomly assigned in equal proportions to one of four groups.

Treatment Group Description
T1 Comparison group: Schools within a school-campus assigned to this group follow the standard government curriculum.
T2 CCE only: Schools within a school-campus assigned to this group recieve CCE only.
T3 TARL only: Schools within a school-campus assigned to this group receive TARL only.
T4 CCE + TARL: Schools within a school-campus assigned to this group receive both TARL and CCE.


A.2 Data description

The authors of the study collected data on student learning outcomes at baseline and endline, as well as teacher and school characteristics. While the real study conducted two experiments in lower and upper primary schools, this exercise will only use the lower primary school data, labeled as primary_cleaned.dta. The RCT evaluated the impact of the programs on four learning outcomes: oral Hindi, oral math, written Hindi, and written math.

Toggle the codebook section below to see some of the main variables we’ll be using in the exercise.

🔍 Codebook
Data Variable in primary_cleaned.dta
School campus super_school_id
Original stratum stratum
Female female
Age (years) age_years
Grade in 2011–12 school year base_standard
Oral Hindi score at baseline (standardized) base_aser_read_norm
Oral math score at baseline (standardized) base_aser_math_norm
Written Hindi score at baseline (standardized) base_h_score_norm
Written math score at baseline (standardized) base_m_score_norm

B. Randomization Coding Walkthrough

B.1 Random assignment

In this section of the exercise, you will replicate the randomization of the original study. Note that even though you are conducting the same randomization process, you will not get the exact same results as the paper since randomization is–you guessed it–random!

The goal is to randomly assign schools to one of each of the 4 treatment groups, clustering at the school campus level and stratifying by the original stratum variable created by the authors. This can be done manually, or by using packages like randomizr in R or randtreat in Stata.

💡 How were the strata created? The stratum variable was constructed by first grouping schools by block and school type (lower, upper, or both). Within each group, school campuses were sorted by average baseline test scores and grouped into strata of four school campuses each to enable stratified random assignment across four treatment arms.
💡 What are misfits and how do I deal with them?

Misfits happen when, during the randomization process, the number of units in a group (like a stratum) doesn’t divide evenly across the treatment groups. For example, if you have 5 units and want to assign them evenly to two groups (treatment and comparison), one unit will be left over—this is a misfit.

In simple cases, such as random assignment to two groups without stratification, you can just randomly assign that leftover unit after dividing up the rest. But when working with multiple strata, you need to decide whether to balance treatment assignments within each stratum or across the whole sample [**ADD NOTE ON GLOBAL V. LOCAL ALLOCATION]

Check out this paper by Carril (2017) for more details on dealing with misfits.

Instructions: Randomize assignment of school campuses to four groups of equal proportion, stratifying on the stratum variable and handling misfits locally. Set your seed to 12345.

Guidance for Stata users


The code below imports the lower primary school data from the study, referencing the global for the data folder you set above with a $.

* Import the data
use "$data/primary_cleaned.dta", clear

The chunk below uses Stata’s preserve and restore functions to conduct the random assignment without altering the original dataset. The code chunk should be run all together, rather than line by line. Since the data is at the student level and we’ll be conducting randomization at the school campus level, we keep only unique school campuses.

We then randomize treatment using the randtreat command we installed earlier:

  • generate : tells Stata to generate the variable treatment which contains the treatment assignment for each school campus
  • strata : asks Stata to stratify by the stratum variable
  • misfits : allows you to decide how to allocate misfits, in this case, locally
  • multiple : option for number of equal-sized groups
  • setseed : allows you to specify a seed for the randomization. This is important so that anybody who runs your code can exactly replicate your randomization!

We’ll then save the results of the randomization in our the temp folder we specified earlier.

* Random assignment
preserve
keep super_school_id stratum
duplicates drop  
sort super_school_id

randtreat, generate(treatment) strata(stratum) ///
    misfits(strata) multiple(4) setseed(12345) 
    
save "$temp/school_campus_treatment.dta", replace
restore

Below, we’ll merge the random assignment of school campuses back into our original dataset using super_school_id, and label the comparison and treatment groups accordingly. Check out your treatment variable using the tab command.

* Merge original data with the generated random assignment
merge m:1 super_school_id using "$temp/school_campus_treatment.dta"
drop _merge

* Label the treatment groups
label define trt 0 "Control" 1 "CCE" 2 "RE" 3 "CCE+RE"
label values treatment trt

tab treatment, m

We’ll then generate two binary dummies for whether students are assigned to a group with any CCE program (i.e., treatment groups 1 and 3) or any TARL program (i.e., treatment groups 2 and 3), as well as binary dummies for each treatment group. These will be used in the next section for our summary table.

Since you will need to do this a few times throughout the exercise, we’ll nest our code to create a very simple program called treatment_dummies. You can call the program later on by running treatment_dummies.

* Create program for generating binary dummies for treatment status 
program define treatment_dummies
    gen any_CCE = . 
    replace any_CCE = 0 if inlist(treatment, 0, 2)
    replace any_CCE = 1 if inlist(treatment, 1, 3)

    gen any_TARL = . 
    replace any_TARL = 0 if inlist(treatment, 0, 1)
    replace any_TARL = 1 if inlist(treatment, 2, 3)

    gen CCE = . 
    replace CCE = 1 if treatment == 1 
    replace CCE = 0 if inlist(treatment, 0, 2, 3) 

    gen TARL = . 
    replace TARL = 1 if treatment == 2 
    replace TARL = 0 if inlist(treatment, 0, 1, 3) 

    gen CCE_TARL = . 
    replace CCE_TARL = 1 if treatment == 3 
    replace CCE_TARL = 0 if inlist(treatment, 0, 1, 2) 
end

* Run the program & check your results
treatment_dummies 
foreach i in any_CCE any_TARL CCE TARL CCE_TARL {
    tab `i', m
}
Guidance for R users

B.2 Baseline balance

Baseline balance refers to whether the treatment and comparison groups are similar at the start of the evaluation, before the program begins.

When we randomly assign units (such as individuals, households, or schools) to treatment and comparison groups, the goal is to create groups that are statistically similar on average—that is, balanced in terms of characteristics that could influence the outcome the program aims to improve (in this case, learning outcomes).

Even with proper randomization, small differences between groups can occur due to chance. This is expected. However, large or systematic differences may suggest problems with the randomization process or the implementation of the assignment.

So, how do we check for baseline balance? Researchers typically test for balance by comparing the means of key baseline variables across treatment groups. This is often done using regression analysis or t-tests to see whether any differences are statistically significant.

💡 Do we always need to conduct balance tests? [NOTE NOTE NOTE] Check out this blog by David McKenzie on when testing for baseline balance might make sense.

Instructions: Estimate treatment and control differences in the key baseline variables included in the codebook above by running a regression of each baseline variable on a dummy for each treatment arm, controlling for the stratification used in the randomization. Print the coefficients and standard errors in a summary table.

Guidance for Stata users


There are multiple ways to code up your balance table. You can start with including all of your baseline variables in a global called outcomelist_base.

* Define list of baseline variables
global outcomelist_base "female age_years base_standard base_aser_read_norm base_aser_math_norm 
base_h_score_norm base_m_score_norm"

We’ll then run a loop that goes through each variable in outcomelist_base and does the following:

  • Gets the number of observation for each variable var
  • Runs a regression of each var on the three treatment indicators: CCE, TARL, and CCE_TARL
  • Gets the mean and standard deviation of each var in the comparison group
  • Appends all of the above to a matrix named balance_table

You must run the chunk below all at once!

* Run loop for each baseline outcome
foreach var in `outcomelist_base' {
        
        * Get number of non-missing observations
        qui count if `var'<. 
        local nobs=r(N)
        
        * Run balance regression
        areg `var' CCE TARL CCE_TARL, cluster(super_school_id) absorb(stratum)  
        
        * Get control group mean and SD (no CCE, no TARL)
        qui sum `var' if e(sample)==1 & any_CCE==0 & any_TARL==0                
        local control_mean=r(mean)
        local control_sd=r(sd)
        
        * Append results to matrix
        mat balance_table = nullmat(balance_table) \ ///
            `control_mean', _b[CCE], _b[TARL], _b[CCE_TARL] \ ///
            `control_sd', _se[CCE], _se[TARL], _se[CCE_TARL]
        }

After that, we can label our matrix and print it to the Stata console.

* Label matrix 
mat balance_table=balance_table[.,1..4]
mat colnames balance_table=control CCE_only TARL_only CCE_TARL

local rowlabels
foreach var in $outcomelist_base {
    local rowlabels `rowlabels' `var'_mean `var'_sd
}

matrix rownames balance_table = `rowlabels'

* Print balance table
matrix list balance_table
Guidance for R users
Code goes here

The resulting balance table should be similar to the first four columns in Table 1 of Berry et al. (2018). The first column shows the mean and standard deviation for each baseline variable in the comparison group. Columns 2 through 4 report the estimated difference in means between the comparison group and CCE, TARL, and CCE combined with TARL, after controlling for stratification and clustering at the school campus level.

For example, in the row for age_years, students in the comparison group are on average 9.03 years old. The coefficient for the treatment group receiving CCE only is 0.070, meaning students in the CCE group are on average 0.07 years older than those in the comparison group. But this difference is small relative to the standard error of 0.058, and not statistically significant, suggesting this variable is not imbalanced across groups.

💡 How do I interpret the standard errors? A standard error measures how precisely a regression coefficient is estimated. It reflects the amount of variation we would expect in the estimate if we repeated the study many times. You can check whether a coefficient is statistically significant using a t-statistic: \[ t = \frac{\text{coefficient}}{\text{standard error}} \] If \(|t| > 1.96\) (in a two-sided test), the coefficient is significant at the 5% level. In the example of mean age (in years) above, the estimated difference in age between the CCE and comparison groups is 0.07 years with a standard error of 0.058. The resulting t-statistic is approximately 0.07 / 0.058 \(\approx\) 1.21, which is less than 1.96, so this difference is not statistically significant.

B.3 Solutions to imbalance

C. Power Calculations Coding Walkthrough

In this section of the exercise, you will conduct parametric power calculations for the study design discussed above based on parameters estimated from the baseline data of the study. While the real study evaluated the impact of CCE and TARL on four learning outcomes, for simplicity we’ll focus only on oral Hindi test scores (base_aser_read_norm).

Keep in mind that in a real study, ex-ante power calculations should be conducted for each planned outcome of interest. If you need a refresher on the components of power calculations, please head to J-PAL’s research resource on power calculations.

💡 What is the difference between parametric and non-parametric power estimation? Add note.

C.1 Estimate parameters from baseline data

In this section, we’ll use the baseline variables from the primary_cleaned.dta dataset to get parameters for the comparison group and the first treatment group for CCE. We’ll need to estimate the average number of students within school campuses, the total number of school campuses, and the outcome variance and intracluster correlation (ICC) of our main outcome of interest, oral Hindi test scores, for the two groups.

Note that in a real study, we would not have access to our main outcome of interest at endline until we run the study, so we need to use the baseline oral Hindi test scores as a proxy.

Instructions: Using the treatment variable you created earlier, estimate and store the following parameters that you’ll use for your power calculations:

  • the average number of students within a school campus for the comparison group and the CCE group
  • the total number of clusters assigned the comparison group and the CCE group
  • the residual standard deviation of oral Hindi test scores (adjusting for strata, clustering, and baseline covariates, including gender, age in years, grade, and baseline test scores)
  • the ICC of oral Hindi test scores
Guidance for Stata users


Include your main outcome variable in a global named outcome and the remaining baseline test scores that you’ll use as covariates in a global named baseline tests.

global outcome "base_aser_read_norm"            // Oral Hindi scores at baseline
global baseline_tests "base_aser_math_norm base_h_score_norm base_m_score_norm" 

We’ll first start by storing estimates for our sample size. Since the study used a clustered design, we’ll want both the total number of clusters in each group and the average cluster size (i.e., the average number of students in each school campus). Use the distinct command we installed earlier to count the total number of school campuses in the comparison group and the CCE treatment group.

count if treatment == 0     // Comparison group
global students_c = r(N)
count if treatment == 1     // CCE treatment group
global students_t = r(N)

distinct super_school_id if treatment == 0      // Comparison group
global clusters_c = r(ndistinct)    
distinct super_school_id if treatment == 1      // CCE treatment group
global clusters_t = r(ndistinct)    

global cluster_size_c = floor($students_c / $clusters_c)
global cluster_size_t = floor($students_t / $clusters_t)

display "The average number of students in the school campuses assigned to the CCE treatment group is $cluster_size_c."
display "The average number of students in comparison school campuses is $cluster_size_t."

Our next step will be to get the residual standard deviation of our outcome variable, base_aser_read_norm, by accounting for covariates and strata. Remember that the residual standard deviation measures the amount of variation in the outcome variable that remains unexplained after accounting for the covariates included in the regression model. In the context of power calculations, calculating the residual variance of our outcome can provide us with an estimate of the “noise” in the outcome variable after adjusting for known factors.

To do this, we run a regression of the outcome on baseline covariates (with stratum fixed effects and clustered standard errors), predict the residuals from this model, and calculate the standard deviation of those residuals.

Note that here we use areg instead of reg to avoid printing the co-efficients of each stratum.

cap drop res
areg $outcome female age_years base_standard $baseline_tests, absorb(stratum) vce(cluster $cluster_var)
predict res, res
sum res  
global res_sd = r(sd)
display "The residual standard deviation is $res_sd."

Finally, we’ll calculate the ICC for our outcome. The loneway command runs a one-way ANOVA to decompose the variance into between-cluster and within-cluster components. The ICC, stored in the global macro $icc, measures the proportion of total variance that is due to differences between clusters. A higher ICC indicates that outcomes within clusters are more similar to each other, and will typically result in lower power, all else equal.

Additional documentation for the loneway command can be found here.

loneway $outcome super_school_id                                                    
global icc = r(rho)
display "The intra-cluster correlation is $icc."
Guidance for R users
Code goes here

C.2 Power calculations

In this section, we’ll conduct power calculations for estimating the minimum detectable effect (MDE) of the study using built-in commands from Stata and R and the parameters we estimated in section C.1. Remember that for a randomized design with more than two groups, you should conduct power calculations for every comparison, especially if you expect to have unequal sized groups.

In this study, the authors made three main comparisons at the end of the programs (for each outcome variable):

For simplicity, we’ll stick to one comparison in this section: test scores in the comparison group compared to test scores in the CCE group. Note that since our groups our roughly equal, we can expect that the power for each comparison will be roughly similar.

Instructions: Using the baseline average oral Hindi test scores and the parameters estimated in section C.1, estimate the MDE you’d be able to detect with your current randomization design. Set power to .80 and alpha (the significance level) to .05.

Guidance for Stata users


First, start by estimating and storing average oral Hindi test scores at baseline. Since the authors standardized this variable, it should be very close to 0.

sum $outcome if !missing($outcome)                                      
global baseline = `r(mean)'

Use the power twomeans command to estimate the MDE of your study, using the global macros you stored in section C.1. The command takes the following parameters:

  • k1 and k2 are the number of clusters in the comparison and treatment groups, respectively
  • m1 and m2 are the average number of units in each cluster in the comparison and treatment groups, respectively
  • rho is the icc
  • sd is the standard deviation you want to use
  • alpha is the significance level
  • power is the power level

Run help power twomeans in your Stata console for additional documentation on the command.

power twomeans $baseline, k1($clusters_c) k2($clusters_t) m1($cluster_size_c) ///
    m2($cluster_size_t) alpha(.05) power(.8) rho($icc) sd($res_sd)
If you have Stata version 18+, the power command also has an option to add the variation of cluster sizes with cvcluster.
Guidance for R users
Code goes here

[ADD NOTE ON INTERPRETATION]

D. Randomization and Power for Different Experimental Designs

In this section of the exercise, you will explore how different experimental design choices affect randomization, balance, and power, using the same baseline dataset from the previous sections. You will also practice estimating the MDE for binary outcome variables, as well as solve for sample size instead of the MDE.

You can check your answers using the blue dropdowns.

Throughout this section, we’ll use a power level of 0.80 and a significance level of 0.05, unless otherwise specified.

D.1 How does the level of randomization affect power?

Suppose that the researchers find that students, teachers, and parents in different school campuses within a village interact regularly (for example, through community meetings, shared events, or informal networks). This makes them think that randomizing at the village level could help reduce spillovers that might occur from treated school campuses to comparison school campuses within the same village.

Instructions: Complete the following exercises:

  • Using the (fictional) village_id variable, randomize four groups at the village-level, without stratifying. Set your seed to 12345.
  • Check baseline balance on written math test scores.
  • Estimate the MDE of the CCE program on oral Hindi test scores.

Note: For this exercise, use the baseline standard deviation (without adjusting for covariates) — do not calculate a residual standard deviation from a regression.

📝 What does balance look like for written math test scores? Are any of the results concerning to you?
Variable Coefficient Std. Err. P-Value
CCE -0.144 0.085 0.097
TARL -0.114 0.109 0.296
CCE and TARL -0.121 0.115 0.296
Constant 0.062 0.067 0.358
📝 What is the smallest effect on oral Hindi test scores due to the CCE program your study would be able to find?
The minimum detectable effect is an increase in test scores of 0.20 standard deviations, or an average of 0.196 standard deviations at endline.

Suppose that after considering the potential for spillovers within villages, the researchers also recognize a trade-off: randomizing at the village level results in a smaller number of clusters, which reduces statistical power. Moreover, further investigation shows that interactions between schools in the same campus are less frequent or less influential than initially thought — in fact, schools operate quite independently, and teachers and parents rarely share information across schools.

Given this, the researchers decide to randomize at the school level.

Instructions: Complete the following exercises:

  • Using the school_id variable, randomize four groups at the school-level, without stratifying. Set your seed to 12345.
  • Check baseline balance on written math test scores.
  • Estimate the MDE of the CCE program on oral Hindi test scores

Note: For this exercise, use the baseline standard deviation (without adjusting for covariates) — do not calculate a residual standard deviation from a regression.

📝 What does balance look like for written math test scores? Would you expect it to be better, worse, or the same when randomizing at a lower level?
Variable Coefficient Std. Err. P-Value
CCE 0.061 0.068 0.372
TARL -0.033 0.074 0.656
CCE and TARL 0.020 0.075 0.795
Constant -0.048 0.052 0.353
📝 What is the smallest effect on oral Hindi test scores due to the CCE program your study would be able to find?
The minimum detectable effect is an increase of 0.143 standard deviations in oral Hindi test scores, or an average of 0.134 standard deviations at endline.

D.2 Unequal allocation ratios

After considering the costs and logistics of program implementation, the researchers find out that providing the CCE and TARL programs to all eligible schools would be too expensive. To meet their budget, they decide to explore how using an unequal allocation ratio (with more schools in the comparison group) affects the study’s statistical power.

💡 How do you expect this unequal allocation to affect your baseline balance? [NOTE NOTE NOTE]

Instructions: Complete the following exercises:

  • Using the school_id variable, randomize four groups at the school-level, without stratifying. Set the comparison group to have 50% of the sample, and each treatment group to have 16.7%. Set your seed to 12345 and deal with misfits globally.
  • Check baseline balance on written math test scores.
  • Estimate the MDE of the CCE program on oral Hindi test scores.

Note: For this exercise, use the baseline standard deviation (without adjusting for covariates) — do not calculate a residual standard deviation from a regression.

Hints for Stata users


The randtreat command in Stata has an option for unequal allocation of treament assignments, unequal(). The example below randomizes four groups, with the comparison group having 20% of the sample, the first treatment group having 40% of the sample, and the third and fourth treatment groups having 20% of the sample each.

randtreat, generate(treatment) unequal(1/5 6/15 3/15 3/15) setseed(85371)
Run help randtreat in your Stata console for additional documentation.
Hints for R users


📝 What does balance look like for written math test scores?
Variable Coefficient Std. Err. P-Value
CCE -0.060 0.075 0.427
TARL -0.086 0.074 0.243
CCE and TARL -0.166 0.061 0.007
Constant 0.017 0.037 0.649
📝 What is the smallest effect on oral Hindi test scores due to the CCE program your study would be able to find?
Surprise! Even with allocating half of the sample to the treatment group, the study would still able to detect an increase in test scores of .143 standard deviations, or an average of 0.134 standard deviations at endline.

D.3 Including covariates in power calculations

Including covariates in power calculations can improve statistical power by accounting for variation in the outcome of interest that is unrelated to the treatment. By explaining some of this variation, covariates reduce the residual variance, allowing for more precise estimates of the treatment effect.

Adjusting for covariates can be especially useful when:

Instructions: Using the same randomization as the previous section, estimate the MDE of the CCE program on oral Hindi test scores adjusting for the following covariates: gender, grade, age (in years), and baseline oral math test scores, written math test scores, and written Hindi test scores.

📝 What is the smallest effect on oral Hindi test scores due to the CCE program your study would be able to find?
The minimum detectable effect decreases from 0.143 without covariates to 0.083 when covariates are included.

D.4 Power calculations for binary variables

So far, you’ve worked with continuous outcomes, where variation is described by the standard deviation and you’ve used this to inform your power calculations. In contrast, binary outcomes take on only two values (e.g., 0/1 for “did not achieve” vs. “achieved”). Instead of using the standard deviation, power calculations for binary outcomes rely on the proportion of individuals in the comparison group with the outcome of interest. If you were doing power calculations manually with a binary variable \(Y\), which takes values 0 or 1 and has mean \(P\), the variance of \(Y\) is given by \(\text{Var}(Y) = P(1 - P)\).

Instructions: Using the same randomization as the previous section, create binary variables for each of the four learning outcomes based on their median (i.e., if an observation has a value greater than the median of the oral Hindi test scores). Then, estimate the MDE you’d be able to detect due to the CCE program for the proportion of students with test scores above the median. Don’t account for covariates.

Hints for Stata users


You can create the binary versions of the learning outcome variables using the loop code chunk below:

foreach var in base_aser_math_norm base_aser_read_norm base_m_score_norm base_h_score_norm  {
    cap drop median_`var'
      cap drop `var'_b
      egen median_`var' = median(`var')
    gen `var'_b = (`var' > median_`var')
}

The Stata power command has an option for working with binary variables, power twoproportions. The function works very similarly to the power twomeans command, however Stata computes the variance of your outcome automatically. Run the line below for further documentation.

help power twoproportions
Hints for R users
📝 What is the smallest effect your study would be able to find on the proportion of students with oral Hindi scores above the median due to the CCE program?
The minimum detectable effect you would be able to find from the CCE program is .063, which represents a 6.3% increase in the share of students with scores above the median, or a total proportion of 56.6% at endline.

D.5 Stratification and subgroup analysis

Stratification and subgroup analysis are important tools for improving study design and for answering specific research questions about how impacts might vary across different groups. Stratification refers to randomization done within subgroups (such as regions, blocks, or baseline characteristics) to help ensure balance between treatment and control groups on key variables. This can improve the precision of your estimates and reduce the likelihood of baseline imbalances, especially in smaller samples or clustered designs.

Subgroup analysis involves examining treatment effects separately for different groups (e.g., by grade level, gender, or baseline ability). This helps assess whether the program works differently across these groups, but it typically comes at the cost of lower statistical power because you are working with smaller samples.

In this section, you will explore how stratification and subgroup analysis affect power and precision, and how these choices interact with your study design.

Instructions: Using the continuous oral Hindi test scores again, complete the following exercises:

  • Randomize treatment assignment to 4 groups of equal proportions at the school campus level. Set your seed to 12345. Don’t stratify your randomization.
  • Conduct power calculations for the MDE of the CCE program on oral Hindi test scores (estimate this using the residual standard deviation after adjusting for covariates).
📝 What is the smallest effect your study would be able to find on oral Hindi scores due to the CCE program?
The minimum detectable effect you would be able to detect from the CCE program is an improvement of .077 standard deviations in oral Hindi test scores, or an average of .076 standard deviations at endline.

Suppose you wanted to understand whether the impact of the CCE program differs by gender. Since randomization was done at the school campus level, and there are both girls and boys in each school, we can’t stratify our randomization by gender. How would you conduct your power calculations to ensure you have enough power to detect a difference in the program impact by gender?

Instructions: Estimate the MDE for the CCE program for your subgroup analysis of girls’ oral Hindi test scores.

📝 What is the smallest effect your study would be able to find on oral Hindi scores due to the CCE program? Would you still plan to run your study as-is based on this result?
The minimum detectable effect you would be able to detect from the CCE program is an improvement of .10 standard deviations in oral Hindi test scores, or an average of .137 standard deviations at endline.

Suppose that you’d like to understand whether the CCE program is more or less effective for students in schools that score lower on average.

Instructions: Complete the following exercises:

  • Create a new variable q_read_campus from base_aser_read_norm that splits school campuses into one of four quartiles based on their average baseline oral Hindi test scores.
  • Randomize treatment at the school-campus level to 4 groups. Set your seed to 12345. Deal with misfits globally.
  • Estimate the MDE for a sub-group analysis for students in the lowest-scoring quartile of schools for the CCE program.
Hints for Stata users


You can set up your quartile variable for school campuses using the code block below.

use "$data/primary_cleaned.dta", clear
matrix drop _all

* Create quartiles of baseline average oral Hindi test scores at the school campus level
collapse (mean) base_aser_read_norm, by(super_school_id)

    * Create quartiles at the school campus level
    xtile q_read_campus = base_aser_read_norm, nq(4)

    * Label the quartiles for clarity 
    cap label define q_read_campus 1 "Q1: Lowest" 2 "Q2" 3 "Q3" 4 "Q4: Highest"
    label values q_read_campus q_read_campus

    * Save the campus-level quartile stratification variable
    save "$temp/campus_quartiles.dta", replace

    * Merge this back into the student-level dataset
    use "$data/primary_cleaned.dta", clear
    merge m:1 super_school_id using "$temp/campus_quartiles.dta"
    drop _merge

    * Check the distribution
    tab q_read_campus, m

    * Label the new variable
    cap label variable q_read_campus "Stratum based on baseline Oral Hindi (school campus avg)"
Hints for R users
📝 What is the smallest effect you’d be able to find for students in the lowest-scoring quartile of schools?
The MDE is .084 standard deviations.

Instructions: Now, complete the following exercises:

  • Randomize treatment at the school-campus level to 4 groups. Stratify your randomization using the q_read_campus variable you created in the previous section, and Set your seed to 12345. Deal with misfits locally.
  • Estimate the MDE for a sub-group analysis for students in the lowest-scoring quartile of schools for the CCE program.
📝 What is the smallest effect you’d be able to detect for students in the lowest-scoring quartile? Based on your output, how do you think the stratification impacted your MDE?
The MDE for the lowest-scoring quartile is 0.078 standard deviations.

D.6 Estimating sample size

D.7 Sensitivity analysis

E. Power Estimation through Simulation