Dear students,
this homework assignment includes two parts with the tasks on randomization and power analysis. Do the tasks and send me your .Rmd file and .html file with the code and the graphs. The files should be sent in LMS. Add you surname to the files names.
Please, don’t delete tasks description so that I could see what code corresponds to what task. Run your code before sending it - make sure that it works without mistakes. Don’t forget to answer the questions in the tasks.
Each task if it’s done correctly brings you 2 scores. 10 scores in total.
Deadline is 23:59, September, 29. You may complete and send the homework during 3 calendar days after the deadline (till 23:59, October 2). In such a case your score will be multiplied by 0.7. I encourage you to complete the homework before the deadline.
Good luck!
## Warning: пакет 'pwr' был собран под R версии 4.2.3
## Warning: пакет 'randomizr' был собран под R версии 4.2.3
## Warning: пакет 'tidyverse' был собран под R версии 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: пакет 'ggplot2' был собран под R версии 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
In this part we are working with PISA2012_russ dataset. PISA is an international study that assesses 15 years old students’ literacy in math, science and reading. We will use some variables of the data collected in Russia.
To complete the tasks, please, see description of randomizr functions here or here
Read the data. Show the first 10 observations.
#Read the data.
data1 <- read.spss("C:/PISA2012_russ.sav", to.data.frame=TRUE)
#Show the first 10 observations.
head(data1, 10)
Unit of observation is a 15 years old student. * Make strata using
Location and StudyProgram variables. You
should make one variable with categories denoting strata for each unique
combination of location and educational program. * Do stratified
randomization - randomly assign students to one control and one
treatment conditions, save this assignment as a new variable in the
dataset. * Make a table that shows the number of students assigned to
these conditions within each stratum. Make a table that shows total
number of students in the control and treatment groups in general.
set.seed(123) #fix our randomized numbers
strata <- paste(data1$Location, data1$StudyProgram, sep = "_") #make stratum names
data1$Assignment1 <- sample(c("Control", "Treatment"), size = nrow(data1), replace = TRUE) #randomly assign to stratum
stratum_table <- table(strata, data1$Assignment1) #make Table 1
print(stratum_table) #show Table 1
##
## strata Control Treatment
## City_basic voc edu 24 21
## City_high edu 79 68
## City_sec voc edu 41 32
## City_up sec general edu 679 703
## Large City_basic voc edu 8 14
## Large City_high edu 62 59
## Large City_sec voc edu 30 25
## Large City_up sec general edu 394 389
## Small Town_high edu 65 77
## Small Town_sec voc edu 1 2
## Small Town_up sec general edu 272 247
## Town_basic voc edu 9 4
## Town_high edu 80 84
## Town_sec voc edu 4 2
## Town_up sec general edu 489 495
## Village_basic voc edu 4 5
## Village_high edu 81 78
## Village_up sec general edu 280 324
total_table <- table(data1$Assignment1) #make Table 2
print(total_table)#show Table 2
##
## Control Treatment
## 2602 2629
cluster and location is a
strata. Do stratified cluster randomization - randomly
assign students into three treatment arms (one control group, and two
treatments - X1 and X2). Save your assignment as a new variable in the
dataset.declare_ra() function)set.seed(123) #fix our randomized numbers
data9 <- subset(data1, Grade == 9) #Select students of the 9th grade only - make a new dataframe.
#Assume that school is a `cluster` and location is a `strata`. Do stratified cluster randomization - randomly assign students into three treatment arms (one control group, and two treatments - X1 and X2). Save your assignment as a new variable in the dataset.
randomized_data9 <- data9 %>%
group_by(SchoolID, Location) %>%
mutate(Assignment2 = sample(c("Control", "X1", "X2"), size = n(), replace = TRUE)) %>%
ungroup()
#Make a table that shows number of students in the control and treatment groups in general.
table(randomized_data9$Assignment2)
##
## Control X1 X2
## 1247 1284 1300
#Show the probabilities of assignment to the treatment arms (use `declare_ra()` function)
declare_ra(N=3831, m_each = c(1247, 1284,1300),
conditions = c("control", "X1", "X2"))
## Random assignment procedure: Complete random assignment
## Number of units: 3831
## Number of treatment arms: 3
## The possible treatment categories are control and X1 and X2.
## The number of possible random assignments is approximately infinite.
## The probabilities of assignment are constant across units:
## prob_control prob_X1 prob_X2
## 0.3255025 0.3351605 0.3393370
Answer: 3831
prob_control prob_X1 prob_X2 0.3255025 0.3351605 0.3393370
To complete the following tasks, please, see Kabacoff (2014). Сh.10
Assume, you are estimating the effect of a summer school on students’ outcomes. Literature says that effect size varies in a range between 0.1 and 0.3 SD. Calculate the sample sizes you need to detect the minimum effect within this range with T-test if your RCT groups are balanced (use a step of 0.01). Following the example described in Kabacoff (2014) make a graph that shows sample sizes needed to detect minimum effect size in this range.
#make a vector with effect sizes
effect_sizes <- seq(0.1, 0.3, by = 0.01)
#create a dataframe for results
sample_size_dataframe <- data.frame(EffectSize = numeric(0), SampleSize = numeric(0))
#calculate sample size and save the results in dataframe we've created
for (effect_size in effect_sizes) {
sample_size <- pwr.t.test(d = effect_size, sig.level = 0.05, power = 0.8)$n
sample_size_dataframe <- rbind(sample_size_dataframe, data.frame(EffectSize = effect_size, SampleSize = sample_size))
}
#make a graph that shows sample sizes needed to detect minimum effect size in this range
ggplot(sample_size_dataframe, aes(x = EffectSize, y = SampleSize)) +
geom_line() +
labs(x = "Effect Size (SD)", y = "Sample Size") +
ggtitle("Sample size to detecte minimum effect size") +
theme_minimal()
How many students will you need for this experiment if you expect to detect minimum effect of 0.15 SD?
#calculate the required sample size
sample_size15 <- pwr.t.test(d = 0.15, sig.level = 0.05, power = 0.8)$n
#print the required sample size
print(round(sample_size15))
## [1] 699
Answer: We need 699 in each group. In total – 1398 people
Assume, you are estimating math tutor effect on the 8th grade
students’ test scores. You are going to use OLS with a number of
covariates. Among them are the pre-test score, father education
(higher/not higher), student sex (boy/girl). In total, all these
covariates explain 20% of variance in the post-test scores. You expect
that in average students attended a tutor will receive 0.1-0.2 SD higher
score in the post-test. Make a graph to show what sample sizes are
needed to detect minimum effect in that range (use a step of 0.001).
Recall how to calculate sample size given v.
Note: To calculate \(sR^2\) given the OLS-coefficient see Cohen, P., West, S. G., & Aiken, L. S. (2014). Applied multiple regression/correlation analysis for the behavioral sciences. Psychology press. P.83 You may assume that randomization makes correlation of your covariates with assignment of students to treatment arms equal zero.
#assign the variables
effect_sizes2 <- seq(0.1, 0.2, by = 0.001)
sample_sizes <- numeric(length(effect_sizes2))
#calculate power for each effect size
for (i in 1:length(effect_sizes2)) {
effect_size <- effect_sizes2[i]
#calculate the required sample size
n <- pwr.t.test(
d = effect_size,
sig.level = 0.05,
power = 0.2,
type = "two.sample",
alternative = "two.sided"
)$n
sample_sizes[i] <- n
}
#create a dataframe for the results
results_df <- data.frame(effect_size = effect_sizes2, sample_size = sample_sizes)
#create a plot
library(ggplot2)
ggplot(results_df, aes(x = effect_size, y = sample_size)) +
geom_line() +
labs(x = "Effect Size", y = "Sample Size") +
theme_minimal()
How many students do you need to select for this experiment if you expect that students attended a tutor in average will receive 0.12 SD higher score in the post-test?
sample_size12 <- pwr.t.test(
d = 0.12,
sig.level = 0.05,
power = 0.8,
type = "two.sample",
alternative = "two.sided"
)$n
print(round(sample_size12))
## [1] 1091
Answer: 1091
In this article authors attempted at estimating the effect of homework with Yandex.Education on the 3rd grade students’ growth mindsets. At p.103 you will find a table with OLS estimates. Based on column 5 of this table estimate power. You may find a proportion of the outcome variance explained by the homework with Yandex.Education in column 1. For power analysis see description of the variables in the paper.
#Your code here
Answer:
How many students are needed to detect statistically significant effect of the reported size?
#Your code here
Answer: