HW 1. Randomization & Power Analysis

Dear students,

this homework assignment includes two parts with the tasks on randomization and power analysis. Do the tasks and send me your .Rmd file and .html file with the code and the graphs. The files should be sent in LMS. Add you surname to the files names.

Please, don’t delete tasks description so that I could see what code corresponds to what task. Run your code before sending it - make sure that it works without mistakes. Don’t forget to answer the questions in the tasks.

Each task if it’s done correctly brings you 2 scores. 10 scores in total.

Deadline is 23:59, September, 29. You may complete and send the homework during 3 calendar days after the deadline (till 23:59, October 2). In such a case your score will be multiplied by 0.7. I encourage you to complete the homework before the deadline.

Good luck!

Install the libraries

## Warning: пакет 'pwr' был собран под R версии 4.2.3

## Warning: пакет 'randomizr' был собран под R версии 4.2.3

## Warning: пакет 'tidyverse' был собран под R версии 4.2.2

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2

## Warning: пакет 'ggplot2' был собран под R версии 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Part А. Randomization

In this part we are working with PISA2012_russ dataset. PISA is an international study that assesses 15 years old students’ literacy in math, science and reading. We will use some variables of the data collected in Russia.

To complete the tasks, please, see description of randomizr functions here or here

Task 1. Stratified randomization (0-2 scores)

Read the data. Show the first 10 observations.

#Read the data. 
data1 <- read.spss("C:/PISA2012_russ.sav", to.data.frame=TRUE)

#Show the first 10 observations. 
head(data1, 10)

Unit of observation is a 15 years old student. * Make strata using Location and StudyProgram variables. You should make one variable with categories denoting strata for each unique combination of location and educational program. * Do stratified randomization - randomly assign students to one control and one treatment conditions, save this assignment as a new variable in the dataset. * Make a table that shows the number of students assigned to these conditions within each stratum. Make a table that shows total number of students in the control and treatment groups in general.

set.seed(123) #fix our randomized numbers

strata <- paste(data1$Location, data1$StudyProgram, sep = "_") #make stratum names

data1$Assignment1 <- sample(c("Control", "Treatment"), size = nrow(data1), replace = TRUE) #randomly assign to stratum

stratum_table <- table(strata, data1$Assignment1) #make Table 1
print(stratum_table) #show Table 1

##                                
## strata                          Control Treatment
##   City_basic voc edu                 24        21
##   City_high edu                      79        68
##   City_sec voc edu                   41        32
##   City_up sec general edu           679       703
##   Large City_basic voc edu            8        14
##   Large City_high edu                62        59
##   Large City_sec voc edu             30        25
##   Large City_up sec general edu     394       389
##   Small Town_high edu                65        77
##   Small Town_sec voc edu              1         2
##   Small Town_up sec general edu     272       247
##   Town_basic voc edu                  9         4
##   Town_high edu                      80        84
##   Town_sec voc edu                    4         2
##   Town_up sec general edu           489       495
##   Village_basic voc edu               4         5
##   Village_high edu                   81        78
##   Village_up sec general edu        280       324

total_table <- table(data1$Assignment1) #make Table 2
print(total_table)#show Table 2

## 
##   Control Treatment 
##      2602      2629

Task 2. Stratified clustered randomization (0-2 scores)

Select students of the 9th grade only - make a new dataframe. How many students have you selected? 3831
Assume that school is a cluster and location is a strata. Do stratified cluster randomization - randomly assign students into three treatment arms (one control group, and two treatments - X1 and X2). Save your assignment as a new variable in the dataset.
Make a table that shows number of students in the control and treatment groups in general.
Show the probabilities of assignment to the treatment arms (use declare_ra() function)

set.seed(123) #fix our randomized numbers

data9 <- subset(data1, Grade == 9) #Select students of the 9th grade only - make a new dataframe.

#Assume that school is a `cluster` and location is a `strata`. Do stratified cluster randomization - randomly assign students into three treatment arms (one control group, and two treatments - X1 and X2). Save your assignment as a new variable in the dataset.
randomized_data9 <- data9 %>%
  group_by(SchoolID, Location) %>%
  mutate(Assignment2 = sample(c("Control", "X1", "X2"), size = n(), replace = TRUE)) %>%
  ungroup()

#Make a table that shows number of students in the control and treatment groups in general.
table(randomized_data9$Assignment2)

## 
## Control      X1      X2 
##    1247    1284    1300

#Show the probabilities of assignment to the treatment arms (use `declare_ra()` function)
declare_ra(N=3831, m_each = c(1247, 1284,1300),
           conditions = c("control", "X1", "X2"))

## Random assignment procedure: Complete random assignment 
## Number of units: 3831 
## Number of treatment arms: 3 
## The possible treatment categories are control and X1 and X2.
## The number of possible random assignments is approximately infinite. 
## The probabilities of assignment are constant across units: 
## prob_control      prob_X1      prob_X2 
##    0.3255025    0.3351605    0.3393370

Answer: 3831

prob_control prob_X1 prob_X2 0.3255025 0.3351605 0.3393370

Part B. Power analysis

To complete the following tasks, please, see Kabacoff (2014). Сh.10

3. Sample size for T-test (0-2 scores)

Assume, you are estimating the effect of a summer school on students’ outcomes. Literature says that effect size varies in a range between 0.1 and 0.3 SD. Calculate the sample sizes you need to detect the minimum effect within this range with T-test if your RCT groups are balanced (use a step of 0.01). Following the example described in Kabacoff (2014) make a graph that shows sample sizes needed to detect minimum effect size in this range.

#make a vector with effect sizes
effect_sizes <- seq(0.1, 0.3, by = 0.01)

#create a dataframe for results
sample_size_dataframe <- data.frame(EffectSize = numeric(0), SampleSize = numeric(0))

#calculate sample size and save the results in dataframe we've created
for (effect_size in effect_sizes) {
  sample_size <- pwr.t.test(d = effect_size, sig.level = 0.05, power = 0.8)$n
  sample_size_dataframe <- rbind(sample_size_dataframe, data.frame(EffectSize = effect_size, SampleSize = sample_size))
}

#make a graph that shows sample sizes needed to detect minimum effect size in this range
ggplot(sample_size_dataframe, aes(x = EffectSize, y = SampleSize)) +
  geom_line() +
  labs(x = "Effect Size (SD)", y = "Sample Size") +
  ggtitle("Sample size to detecte minimum effect size") +
  theme_minimal()

How many students will you need for this experiment if you expect to detect minimum effect of 0.15 SD?

#calculate the required sample size
sample_size15 <- pwr.t.test(d = 0.15, sig.level = 0.05, power = 0.8)$n

#print the required sample size
print(round(sample_size15))

## [1] 699

Answer: We need 699 in each group. In total – 1398 people

4. Sample size for OLS with covariates (0-2 scores)

Assume, you are estimating math tutor effect on the 8th grade students’ test scores. You are going to use OLS with a number of covariates. Among them are the pre-test score, father education (higher/not higher), student sex (boy/girl). In total, all these covariates explain 20% of variance in the post-test scores. You expect that in average students attended a tutor will receive 0.1-0.2 SD higher score in the post-test. Make a graph to show what sample sizes are needed to detect minimum effect in that range (use a step of 0.001). Recall how to calculate sample size given v.

Note: To calculate \(sR^2\) given the OLS-coefficient see Cohen, P., West, S. G., & Aiken, L. S. (2014). Applied multiple regression/correlation analysis for the behavioral sciences. Psychology press. P.83 You may assume that randomization makes correlation of your covariates with assignment of students to treatment arms equal zero.

#assign the variables
effect_sizes2 <- seq(0.1, 0.2, by = 0.001)
sample_sizes <- numeric(length(effect_sizes2))

#calculate power for each effect size
for (i in 1:length(effect_sizes2)) {
  effect_size <- effect_sizes2[i]
  
  #calculate the required sample size
  n <- pwr.t.test(
    d = effect_size,    
    sig.level = 0.05,
    power = 0.2,  
    type = "two.sample", 
    alternative = "two.sided"
  )$n
  
  sample_sizes[i] <- n
}

#create a dataframe for the results
results_df <- data.frame(effect_size = effect_sizes2, sample_size = sample_sizes)

#create a plot
library(ggplot2)
ggplot(results_df, aes(x = effect_size, y = sample_size)) +
  geom_line() +
  labs(x = "Effect Size", y = "Sample Size") +
  theme_minimal()

How many students do you need to select for this experiment if you expect that students attended a tutor in average will receive 0.12 SD higher score in the post-test?

sample_size12 <- pwr.t.test(
  d = 0.12,  
  sig.level = 0.05, 
  power = 0.8,    
  type = "two.sample",
  alternative = "two.sided"
)$n

print(round(sample_size12))

## [1] 1091

Answer: 1091

5. Sample size for OLS with covariates (0-2 scores)

In this article authors attempted at estimating the effect of homework with Yandex.Education on the 3rd grade students’ growth mindsets. At p.103 you will find a table with OLS estimates. Based on column 5 of this table estimate power. You may find a proportion of the outcome variance explained by the homework with Yandex.Education in column 1. For power analysis see description of the variables in the paper.

#Your code here

Answer:

How many students are needed to detect statistically significant effect of the reported size?

#Your code here

Answer: