Reproducibility Report for an SEL Program Efficacy Study by Low et al. (2019, Developmental Psychology)

Introduction

Social-emotional learning (SEL) programs have recently gained increasing attention. Therefore, it is vital to evaluate the program’s efficacy in promoting social-emotional and academic skills. Low et al. (2019) utilized a two-year dataset from Second Steps that recorded behavioral reports and observations on over 9,000 students from 61 schools and 7 districts.

The longitudinal data allowed the researchers to investigate the type of growth curve models that could explain the developmental trajectories the best. The study found out that in most measurements, students’ performance only improved during school, not summer. Thus, the “summer slide” phenomenon was evident, and a linear growth model was a poor fit.

The researchers continued to use these models to examine the efficacy of the Second Steps program. Students from the experimental group outperformed those in the control group in emotional symptoms, social-emotional scores, skills for learning, and emotional management. However, the program did not show to improve students’ academic performance, academic engagement, and disruptive behavior, and most effect sizes of the results were relatively small.

The current project aims to reproduce the main results from this paper: 1) find the optimal growth curve model for each measurement; and 2) evaluate the program’s efficacy in SEL and academic measures. If possible, I would also like to test the model fit via other measures, though it is outside the scope of my current statistical knowledge.

Justification for choice of study

The linear growth curve model is widely used to estimate student’s performance over time. If it does not accurately describe the pattern like Low et al. suggested (2019), further longitudinal research on student program outcomes should pursue other models.

Additionally, the paper only used one method to measure the model fit. If more measures are available, the results will be more robust.

Anticipated challenges

First, the original data analysis was conducted via SAS, and changing to R could induce discrepancies in results.

Secondly, I have not tried growth curve models, and multilevel analysis was still a little new to me, so I need help from either the class or the authors.

Thirdly, I have not encountered the implementation used to account for missing data in the paper, so I need to learn more about this method and its implications.

Links

Project repository (on Github): https://github.com/psych251/low2019.git

Original paper (as hosted in your repo): https://github.com/psych251/low2019/blob/cbde9963c81eae601a1597c40d561b6119235435/original_paper/low2019.pdf

Methods

Description of the steps required to reproduce the results

First, I will review the datasets and the codebook, and remove irrelevant variables.
Next, I will conduct a round of exploratory analysis, such as finding the number of students involved in each group and average change in scores over time.
The next part is to reproduce the main result – finding the fit for eight growth curve models. I will first specify the parameters of each model, use R packages to estimate the models, and find the best fit with \(\Delta AICc\), as the paper suggested. Potentially I could use other methods to compare the fit of the models, to ensure the robustness of the result.

If time allows, I will also reproduce the main effect of the intervention, using the best fit models from (4).

Differences from original study

The authors used SAS to run the analyses. Since I will use R, I expect some discrepencies in results.
The first author mentioned to me that they made nuanced decisions about missing data, imputations and outliers that might not be explicit in the paper.

I expect that these two differences will make it hard to reproduce the exact values in the results. However, if the results are robust enough, I should be able to come to the same conclusions (i.e. what model is the best fit for each DV?), which is my goal of this project.

Measure of success

Identify all relevant variables.
Missing data should be adequately handled.
The models should be constructed correctly with all nested variables specified. (Coefficients should be similar to the paper’s results)
The best fit model found by reproduction should be consistent with the paper’s results.

Pipeline progress

Below is a summary of the dataset.

Timeline

Year 1: Fall 2012, Spring 2013

Year 2: Fall 2013, Spring 2014

Identifier

Variables:

StudentID: 8-digit number
TeacherID: 6-digit number
SchoolID: 3-digit number
District: string names
Site: string, ASU or UW

Each variable is nested within the later variable.

The condition (Delayed Start vs. Early Start) is manipulated on the school level.

Condition

Variable:

Condition: string, DELAYED START or EARLY START

Outcome

DESSA

36 items, teacher-reported rating scale of student’s social emotional competencies

Range: 0 - 4

Subscales:

skills for learning (1,6,7,9,12,14,16,20,23)
empathy (5,8,17,18,22,26,29,34,35)
emotion management (2,3,15,21,25,27,28,31,26)
problem solving (4,10,11,13,19,24,30,32,33)
social-emotional composite (aggregate)

Variables:

DESSA1 - DESSA36: individual item score

SDQ

25 items, teacher-reported scale of student’s behavior

Range: 0 (not true) - 2 (certainly true)

Subscales:

peer problems (6, 11r, 14r, 19, 23)
hyperactivity (2, 10, 15, 21r, 25r)
conduct problems (5, 7r, 12, 18, 22)
prosocial (1, 4, 9, 17, 20)
emotional symptoms (3, 8, 13, 16, 24)

Variables:

SDQ1 - SDQ25: individual item score

Academic Engagement Time

Percent intervals of academic engagement time, from direct observation

Range: 0 - 1

Variable:

Disruptive Behavior

Percent intervals of disruptive behavior, from direct observation

Range: 0 - 1

M-CBM

Percentage correct on M-CBM math assessment

no granular data

Variable:

MATHPC: total problems correct in M-CBM
MATHPCT: percent problems correct in M-CBM
MATHDC: count of digits correct in M-CBM
MATHDCM: digits correct per minute

R-CBM

Oral Reading Fluency Words Reading Correct Per Minute

“Because reading instruction focused on connected text generally does not begin until first grade, we expected different growth patterns for R-CBM among students who began the study in kindergarten than those who began the study in Grades 1 or 2.”

Range: (Spring 2013) 0 - 229

Variable:

ORFPM: words reading correct per minute

Other Variables of Interest

StudentRaceS: string, student race ((1) Asian (2) Native Hawaiian or Pacific Islander (3) Black or African American(4) American Indian or Alaska Native (5) Caucasian or White (6) More than one Race (7) Hispanic)
Sex: binary 0 boy/1 girl, student’s gender
StudentGrade: student’s grade
Stud_FRLunchS: whether student is in the free lunch program

(Summary statistics of these variables are under exploratory_analysis.Rmd.)

Results

Data preparation

The data cleaning codes are placed in data_wrangling.Rmd under the analysis folder. The cleaned dataset is data_average.csv under the data folder. Due to space limit, I will only describe the wrangling process here:

Use the foreign package to convert .sav file to .csv file. They are stored as data.sav (only locally due to its large size) and data.csv in the data folder. This is a wide dataset with each test item at each time point as a column.
Select desired variables, and pivot the table longer.
Remove an outlier that might result from data entry error. Reverse the score for reversely-coded test items.
Then, find the aggregate score for each measure. For each model type, create a time variable with the coding method described in the paper.

Below is the clean dataset:

data_average <- read_csv(here("data", "data_average.csv"))
head(data_average)

## # A tibble: 6 x 29
##   StudentID SchoolID Condition   Site  District StudentGender StudentRaceS
##       <dbl>    <dbl> <chr>       <chr> <chr>    <chr>         <chr>       
## 1 101101125      101 Early Start UW    Renton   Male          (1) Asian   
## 2 101101125      101 Early Start UW    Renton   Male          (1) Asian   
## 3 101101125      101 Early Start UW    Renton   Male          (1) Asian   
## 4 101101125      101 Early Start UW    Renton   Male          (1) Asian   
## 5 101101189      101 Early Start UW    Renton   Female        (7) Hispanic
## 6 101101189      101 Early Start UW    Renton   Female        (7) Hispanic
## # ... with 22 more variables: Stud_FRLunchS <dbl>, StudentGrade <chr>,
## #   time <dbl>, time_A <dbl>, time_B <dbl>, time_C <dbl>, time_D <dbl>,
## #   time_E <dbl>, orfpm <dbl>, mathdcm <dbl>, db <dbl>, aet <dbl>,
## #   sdq_emotion <dbl>, sdq_social <dbl>, sdq_conduct <dbl>, sdq_hyper <dbl>,
## #   sdq_peer <dbl>, dessa_composite <dbl>, dessa_solve <dbl>,
## #   dessa_emotion <dbl>, dessa_empathy <dbl>, dessa_skill <dbl>

Key analysis

The key analysis is to find the best fit linear mixed-effect model for each outcome variable out of eight possible options. As shown in Figure 1, Model A assumes same growth from each time point to the next; B only predicts growth during the school year, and scores remain the same from the end of spring to start of fall (T2 and T3); C assumes growth during Year 1 (T1 to T2), but the gain stops afterward; D assumes that there is growth during the school year, but students’ scores drop during the summer (T2 and T3); finally, E adds a quadratic term to A.

Model Set 1 and 2 differ in how the control group’s performances are estimated. Model Set 1 assumes that the control group’s growth can be predicted linearly, while Model Set 2 uses the same piecewise slope for controls as for the intervention group.

Figure 1. Model demonstration from the paper.

Since the authors did not make clear in the paper what fixed and random effects are, I tried several versions using the lme4 package:

Test score ~ Condition x Time + Time | School ID + 1 | Student ID

Fixed effects of condition, time and their interaction; random slope of time for each school; random intercept for each student; students and schools are crossed (I think?).

Test score ~ Condition x Time + Time | School ID / Student ID

Fixed effects of condition, time and their interaction; random slope of time for each student; students are nested within schools.

Test score ~ Condition x Time + 1 | School ID / Student ID

Fixed effects of condition, time and their interaction; random intercept for each student; students are nested within schools.

When modeling E2, a quadratic Time term is added as a fixed effect.

The last formula provides the most approximate result. It also produces fewest convergent issues. Therefore, I choose to use this formula when fitting the models.

In the effort of handling missing data, I follow the paper’s method and use maximum likelihood estimation. However, I can’t use auxiliary variables during the process, as the authors did not mention the exact auxiliary variables utilized for estimating each dependent variable.

After specifying the function, I use a for loop to fit each model. In each iteration, I also use a function from AICcmodavg to calculate \(AICc\). Below is the code for estimating Model Set 2.

time_vars_2 = c("time_B","time_C","time_D","time_E")
tests = c(colnames(data_average)[17:29])
mod_result <- as.data.frame(matrix(NA, 140, 5)) %>%
  rename(model_num = V1,
         model_set = V2,
         test_question = V3,
         aicc_value = V4,
         coef_list = V5)

n <- 0

for (time_var in time_vars_2) {
  
  for (test in tests){
    n <- n + 1
    mod <- run_lmer(time_var, test, data_average)
    aicc <- AICcmodavg::AICc(mod, return.K = F, second.ord = T)

    mod_result$model_num[n] <- time_var
    mod_result$model_set[n] <- 2
    mod_result$test_question[n] <- test
    mod_result$aicc_value[n] <- aicc
    mod_result$coef_list[n] <-  list(c(unname(fixef(mod))))
  }
}

As a side note, the authors grouped students into kindergarten versus elementary school when testing the reading score, as reading instruction is not introduced to children until their first grade. Therefore, I split data into two groups and their models on reading are estimated separately.

Then, using the following code, I calculate \(\Delta AICc\) for each model by finding the minimum \(AICc\) value out of eight models, and subtract it from each model’s \(AICc\).

mod_result <- mod_result %>%
  group_by(test_question) %>%
  mutate(min_value = min(aicc_value)) %>%
  ungroup() %>%
  mutate(diff = aicc_value - min_value)

The model comparison is demonstrated in the following data frame, in which my \(\Delta AICc\) values are listed under the diff column, and the results from the original paper are in paper_diff.

mod_result_compare <- read_csv(here("data","mod_result_compare.csv"))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   model_letter = col_character(),
##   model_set = col_double(),
##   test_question = col_character(),
##   aicc_value = col_double(),
##   min_value = col_double(),
##   diff = col_double(),
##   model_name = col_character(),
##   paper_diff = col_double()
## )

head(mod_result_compare)

## # A tibble: 6 x 8
##   model_letter model_set test_question aicc_value min_value  diff model_name    
##   <chr>            <dbl> <chr>              <dbl>     <dbl> <dbl> <chr>         
## 1 B                    2 mathdcm           80502.    80502.  0    mathdcm_B2    
## 2 B                    2 db               205032.   205029.  2.42 db_B2         
## 3 B                    2 aet              221982.   221960. 21.0  aet_B2        
## 4 B                    2 sdq_emotion       95313.    95310.  2.92 sdq_emotion_B2
## 5 B                    2 sdq_social       103794.   103714. 80.7  sdq_social_B2 
## 6 B                    2 sdq_conduct       86574.    86570.  3.66 sdq_conduct_B2
## # ... with 1 more variable: paper_diff <dbl>

To further compare my replications with the original study findings, I create scatter plots for each dependent variable, where each dot represents a model. The x-axis is the \(\Delta AICc\) value of that model calculated by me, and the y-axis is the number recorded on the paper.

make_aicc_plot <- function(test) {
  data <- mod_result_compare %>% filter(test_question == test) %>%
    mutate(model_name_short = paste(model_letter, model_set, sep = ""))
  ggplot(data, aes(x = diff, y = paper_diff, color = model_name_short,
                   label = model_name_short)) +
  geom_point() +
  coord_fixed(ratio = 1) +
  geom_abline(slope=1) +
  labs(title = paste(test),
       x = "replication AICc diff",
       y = "paper AICc diff") +
  geom_text(aes(label=model_name_short),hjust= 0, vjust=0) + 
  theme_classic() +
  theme(legend.position = "none")
}

par(mfrow = c(5, 3))
for (i in unique(mod_result_compare$test_question)) {
  print(make_aicc_plot(i))
}

The 45 degree line shows where points should be if my replication findings are exactly the same as the paper’s. If any point is at the origin, then the paper and my replication agree on the same best fit model for that dependent variable. In sum, I am able to replicate 12 out of the 15 model selection decisions.

Some trends can be spotted from these scatter plots. First, both academic engaged time (aet) and disruptive behavior (db) seem to produce the most diverging \(\Delta AICc\)s, especially for model D2. They both happen to be observation measures that were tested in the same settings. For academic tests – maths (mathdcm) and reading (orfpm_Eand orfpm_K), models such as C2 and D2 deviate from the 45 degree line greatly. Additionally, some measures in SDQ and DESSA such as SDQ conduct problems and DESSA empathy show systematically greater \(\Delta AICc\) values from the paper.

Exploratory analyses

I continue to plot the growth curve predictions given the coefficients estimated from my reproduction and from the paper. Below is the set of plots for models that both my and paper’s results agree to be the best fits.

plot_growth_df <- read_csv(here("data","plot_growth_df.csv"))

colors <- c("my" = "black", "paper" = "red")
# "actual" = "green" 
plot_growth <- function(mod_name) {
  

  plot <-ggplot(plot_growth_df %>% filter(model_name == paste0(mod_name)),
       aes(x = time_point)) +
    geom_line(aes(group=1, y = my_early,color = "my", linetype = "early"),  size = 0.8) +
    geom_line(aes(group=1, y = my_delayed, color = "my", linetype = "delayed"),  size = 0.8) +
    geom_line(aes(group=1, y = paper_early,  color = "paper", linetype = "early"),  size = 0.8) +
    geom_line(aes(group=1, y = paper_delayed, color = "paper", linetype = "delayed"), size = 0.8) +
    theme_bw() +
    labs(x = "Time Point",
       y = "Score",
       title = mod_name) +
    geom_point(aes(y = my_early), color = "black")+
    geom_point(aes(y = my_delayed), color = "black") +
    geom_point(aes(y = paper_early), color = "red") +
    geom_point(aes(y = paper_delayed), color = "red") +
    scale_color_manual(values = colors)
  print(plot)

}

correct_model_intercept <- plot_growth_df %>%
  filter(agreement == "both")
for (i in unique(correct_model_intercept$model_name)) {
  plot_growth(i)
}

These plots demonstrate that the replication results are very similar to the original study results, as black lines representing the replication findings and the red lines representing the paper’s findings are close to each other.

Figure 2. Growth curve predictions for disruptive behavior and actual mean scores.(code available in main_analysis.Rmd)

However, when different models are chosen as best models, the alignment is no longer visible. For example, the curves from the original study and reproduction are drastically different for disruptive behavior (see Figure 2). The paper chooses B2, while my analysis chooses C2 as the best model. After adding the mean scores for both conditions at every time point to the plot, however, I notice that neither of B2 and C2 are good models to predict scores of disruptive behavior. The mean scores for the intervention group (early start) shows a somewhat D2-like pattern, implying that disruptive behavior in the classroom increased for children in the SEL curriculum when the summer finished, but declined again at the end of the second year. On the other hand, the students’ disruptive behavior in the control group peaked at the end of the first year, and declined as they grew older. None of the eight models prooposed by the author could plot these patterns at the same time.

Besides these main analyses, I’ve also done descriptive analyses on the dataset. First, I compute the mean, standard deviation, sample size, missing data rate for each test at each time, separated by conditions. A direct comparison with the original paper’s statistics (in the supplemental material) demonstrates that the sample size is slightly smaller in my clean dataset, implying that the authors removed a few children due to various reasons.

data_average %>%
  pivot_longer(cols = orfpm:dessa_skill,
               names_to = "test_question",
               values_to = "score") %>%
  group_by(test_question, Condition, time) %>%
  summarize(n = sum(is.na(score) == F),
            mean=mean(score, na.rm = T),
            sd = sd(score, na.rm = T),
            missing_rate = sum(is.na(score)== T)/n()) %>%
  kable(caption = "Table 1. Descriptive statistics for each tests, separated by condition and time. ") %>%
  collapse_rows(columns = 1, valign = "top")

## `summarise()` has grouped output by 'test_question', 'Condition'. You can override using the `.groups` argument.

Table 1. Descriptive statistics for each tests, separated by condition and time.
test_question	Condition	time	n	mean	sd	missing_rate
aet	Delayed Start	1	3268	83.341799	20.186057	0.2773109
	Delayed Start	2	2999	80.087918	22.457127	0.3367979
	Delayed Start	3	2983	80.327348	21.266099	0.3403361
	Delayed Start	4	2973	79.986882	22.417906	0.3425475
	Early Start	1	3322	81.903198	20.672618	0.2798613
	Early Start	2	3079	79.739821	22.567112	0.3325385
	Early Start	3	3167	81.438637	20.833767	0.3134620
	Early Start	4	3010	81.154214	21.486995	0.3474962
db	Delayed Start	1	3269	8.815753	14.370629	0.2770898
	Delayed Start	2	2998	9.602048	16.502955	0.3370190
	Delayed Start	3	2983	9.280352	15.517967	0.3403361
	Delayed Start	4	2973	9.078260	15.819513	0.3425475
	Early Start	1	3330	9.531577	15.500510	0.2781270
	Early Start	2	3079	8.607429	14.647855	0.3325385
	Early Start	3	3160	9.116886	15.402172	0.3149794
	Early Start	4	3010	8.755512	15.304671	0.3474962
dessa_composite	Delayed Start	1	3064	94.243799	26.136042	0.3224237
	Delayed Start	2	2738	101.794010	26.069872	0.3945157
	Delayed Start	3	2803	99.496611	25.153311	0.3801415
	Delayed Start	4	2880	101.717361	25.730278	0.3631137
	Early Start	1	3321	96.097260	25.570076	0.2800780
	Early Start	2	2864	106.480098	25.097051	0.3791459
	Early Start	3	2940	100.432993	24.907968	0.3626707
	Early Start	4	2820	105.217021	25.748608	0.3886842
dessa_emotion	Delayed Start	1	3085	23.474878	6.543759	0.3177797
	Delayed Start	2	2742	25.247629	6.662086	0.3936311
	Delayed Start	3	2806	24.527798	6.373836	0.3794781
	Delayed Start	4	2882	25.268217	6.618730	0.3626714
	Early Start	1	3324	24.115824	6.347790	0.2794277
	Early Start	2	2871	26.604319	6.342559	0.3776284
	Early Start	3	2960	25.012500	6.172154	0.3583351
	Early Start	4	2834	26.420607	6.489140	0.3856493
dessa_empathy	Delayed Start	1	3141	23.154091	7.302853	0.3053958
	Delayed Start	2	2758	25.311095	6.992277	0.3900929
	Delayed Start	3	2819	24.826179	6.779997	0.3766033
	Delayed Start	4	2896	25.292818	6.940649	0.3595754
	Early Start	1	3342	23.600239	6.995229	0.2755257
	Early Start	2	2893	26.537850	6.761986	0.3728593
	Early Start	3	2999	24.853951	6.796234	0.3498808
	Early Start	4	2878	26.177206	6.847655	0.3761110
dessa_skill	Delayed Start	1	3143	24.488705	7.331544	0.3049536
	Delayed Start	2	2757	26.103373	7.313614	0.3903140
	Delayed Start	3	2823	25.413390	7.194734	0.3757187
	Delayed Start	4	2898	25.946860	7.211290	0.3591331
	Early Start	1	3360	24.579167	7.159912	0.2716237
	Early Start	2	2914	26.947495	6.886231	0.3683070
	Early Start	3	3008	25.500332	7.187775	0.3479298
	Early Start	4	2878	26.472203	7.240716	0.3761110
dessa_solve	Delayed Start	1	3134	23.317167	6.927621	0.3069438
	Delayed Start	2	2759	25.179413	6.910864	0.3898717
	Delayed Start	3	2822	24.710135	6.661430	0.3759398
	Delayed Start	4	2896	25.169544	6.816665	0.3595754
	Early Start	1	3342	23.766906	6.723500	0.2755257
	Early Start	2	2882	26.297710	6.607101	0.3752439
	Early Start	3	2988	25.026774	6.585226	0.3522653
	Early Start	4	2863	26.028292	6.868453	0.3793627
mathdcm	Delayed Start	1	3275	1.554898	1.437160	0.2757629
	Delayed Start	2	3077	3.173817	1.617941	0.3195489
	Delayed Start	3	3045	3.335304	1.465916	0.3266254
	Delayed Start	4	2921	4.658465	1.452153	0.3540469
	Early Start	1	3361	1.486989	1.454646	0.2714069
	Early Start	2	3145	3.123084	1.694985	0.3182311
	Early Start	3	3161	3.299510	1.466015	0.3147626
	Early Start	4	2983	4.631830	1.394130	0.3533492
orfpm	Delayed Start	1	3265	24.303216	33.398587	0.2779743
	Delayed Start	2	3029	48.397491	42.280435	0.3301636
	Delayed Start	3	3035	60.635255	44.578511	0.3288368
	Delayed Start	4	2934	83.632243	41.825960	0.3511720
	Early Start	1	3326	22.722790	33.739874	0.2789941
	Early Start	2	3110	48.604502	45.104566	0.3258183
	Early Start	3	3145	57.881399	45.667438	0.3182311
	Early Start	4	2971	82.048132	42.549136	0.3559506
sdq_conduct	Delayed Start	1	3148	1.050508	1.739188	0.3038479
	Delayed Start	2	2764	1.096961	1.806612	0.3887660
	Delayed Start	3	2845	1.018278	1.720139	0.3708536
	Delayed Start	4	2896	1.245856	1.966116	0.3595754
	Early Start	1	3361	1.066052	1.675039	0.2714069
	Early Start	2	2896	1.068370	1.723601	0.3722090
	Early Start	3	3012	1.014940	1.653105	0.3470626
	Early Start	4	2883	1.116545	1.818904	0.3750271
sdq_emotion	Delayed Start	1	3148	1.152160	1.875377	0.3038479
	Delayed Start	2	2765	1.321881	1.925487	0.3885449
	Delayed Start	3	2849	1.374868	1.968827	0.3699690
	Delayed Start	4	2901	1.574285	2.093277	0.3584697
	Early Start	1	3352	1.249403	1.916295	0.2733579
	Early Start	2	2906	1.213696	1.853897	0.3700412
	Early Start	3	3017	1.293669	1.904884	0.3459788
	Early Start	4	2884	1.362344	1.981562	0.3748103
sdq_hyper	Delayed Start	1	3148	3.358958	3.121030	0.3038479
	Delayed Start	2	2764	3.241317	3.125904	0.3887660
	Delayed Start	3	2849	3.457353	3.133111	0.3699690
	Delayed Start	4	2900	3.428965	3.168291	0.3586908
	Early Start	1	3370	3.529080	3.096913	0.2694559
	Early Start	2	2914	3.109128	3.002305	0.3683070
	Early Start	3	3017	3.458071	3.124778	0.3459788
	Early Start	4	2885	3.302253	3.133885	0.3745935
sdq_peer	Delayed Start	1	3143	1.413618	1.675477	0.3049536
	Delayed Start	2	2762	1.247647	1.643598	0.3892083
	Delayed Start	3	2843	1.292649	1.653959	0.3712959
	Delayed Start	4	2899	1.322180	1.710912	0.3589120
	Early Start	1	3340	1.457784	1.657843	0.2759592
	Early Start	2	2885	1.177816	1.569891	0.3745935
	Early Start	3	3007	1.389425	1.611975	0.3481465
	Early Start	4	2883	1.312522	1.637034	0.3750271
sdq_social	Delayed Start	1	3121	7.012496	2.508735	0.3098187
	Delayed Start	2	2762	7.657495	2.369979	0.3892083
	Delayed Start	3	2814	7.582445	2.358274	0.3777090
	Delayed Start	4	2893	7.690633	2.393955	0.3602388
	Early Start	1	3330	7.144444	2.483682	0.2781270
	Early Start	2	2880	7.739236	2.336057	0.3756774
	Early Start	3	3003	7.432234	2.365993	0.3490137
	Early Start	4	2865	7.669110	2.356223	0.3789291

I also replicated the correlation plot from the paper. (Note: reading score orfpm is removed from the plot because the paper used data filtered by student grade, but it is hard to do so here).

Figure 3. Correlation plot at Time 1.(code available in exploratory_analysis.Rmd )

Figure 4. Correlation plot at Time 4. (code available in exploratory_analysis.Rmd )

Table 2. Correlation plot from the paper. Upper right: Time 1; bottom left: Time 4.

The correlation coefficients are not exactly the same, but in similar ranges. DESSA measures have strong intercorrelations, but the correlations within SDQ are a little weaker. Classroom behavior measures (db, aet) and maths score (mathdcm) are only weakly correlated with other variables most of the time.

Discussion

Summary of Reproduction Attempt

In sum, I am able to replicate 12 out of the 15 model selection decisions. In that sense, I partially reproduce the results from the paper. However, since the paper and I cannot agree on the fit of models for three outcome variables, and some \(\Delta AICc\) values vary drastically even if the best fit model converges, I suspect that there are some systematic errors underneath my analysis, but I haven’t been able to spot them. Possible reasons that lead to variations could include: 1) the fixed and random effects are not specified in the way the paper does; 2) Removing the auxiliary variables in FIML estimation could cause differences; 3) The decisions of removing outliers are unclear, and therefore we might have been using slightly different datasets.

Commentary

Here are some of the things I notice and contemplate on when working on this project:

Interestingly, the authors in fact collected data from five time points! They measured everything again in Year 3. I’m interested in how results could have changed if the models are estimated with five waves, and how many more possible models could be proposed.
The above discussion connects to this concern: with so many assumptions being made on the models, do the models suffer from overfitting? If the dataset is extended to 10 years, do the score fluctuations that we capture in these models actually matter?
One question that bugs me is the implication of this research. At this point, I can say with a lot of confidence that a simple linear growth model assuming a same slope over time cannot work well. However, I find it hard to justify why the developmental trajectory of a certain feature should be estimated by, for instance, D2, but not C2. And I doubt if these results can be replicated with a different sample.
Due to time constraint, I wasn’t able to reproduce the main effects of the SEL curriculum. But looking at the paper’s conclusions, with an effect size of 0.10, I am not sure if SEL intervention is trully effective. I wonder if it could have a even longer term benefits that can’t be captured over the course of two years.
I am grateful that I have received a lot of help from the authors. I especially appreciate the long email from Keith that answered my questions with so many details.
And finally, thank you, the PSYCH 251 teaching team!! I’m probably one of the few people, if not none, who don’t have a lab or PI to rely on, and I can’t complete this project without all of your help. I want to thank Julie for the exhaustive feedback, Sarah for helping me with debugging, Effie for answering my question during OH, and Mike for those extended office hours to help me with everything!!

This project is written in R, with the help of the following packages: tidyverse, knitr, lme4, here, kableExtra, readr, foreign, AICcmodavg.

Reproducibility Report for an SEL Program Efficacy Study by Low et al. (2019, Developmental Psychology)

Iris Zhong (izhong@stanford.edu)

December 03, 2021

Introduction

Justification for choice of study

Anticipated challenges

Links

Methods

Description of the steps required to reproduce the results

Differences from original study

Measure of success

Pipeline progress

Timeline

Identifier

Condition

Outcome

DESSA

SDQ

Academic Engagement Time

Disruptive Behavior

M-CBM

R-CBM

Other Variables of Interest

Results

Data preparation

Key analysis

Exploratory analyses

Discussion

Summary of Reproduction Attempt

Commentary