Introduction

Social-emotional learning (SEL) programs have recently gained increasing attention. Therefore, it is vital to evaluate the program’s efficacy in promoting social-emotional and academic skills. Low et al. (2019) utilized a two-year dataset from Second Steps that recorded behavioral reports and observations on over 9,000 students from 61 schools and 7 districts.

The longitudinal data allowed the researchers to investigate the type of growth curve models that could explain the developmental trajectories the best. The study found out that in most measurements, students’ performance only improved during school, not summer. Thus, the “summer slide” phenomenon was evident, and a linear growth model was a poor fit.

The researchers continued to use these models to examine the efficacy of the Second Steps program. Students from the experimental group outperformed those in the control group in emotional symptoms, social-emotional scores, skills for learning, and emotional management. However, the program did not show to improve students’ academic performance, academic engagement, and disruptive behavior, and most effect sizes of the results were relatively small.

The current project aims to reproduce the main results from this paper: 1) find the optimal growth curve model for each measurement; and 2) evaluate the program’s efficacy in SEL and academic measures. If possible, I would also like to test the model fit via other measures, though it is outside the scope of my current statistical knowledge.

Justification for choice of study

The linear growth curve model is widely used to estimate student’s performance over time. If it does not accurately describe the pattern like Low et al. suggested (2019), further longitudinal research on student program outcomes should pursue other models.

Additionally, the paper only used one method to measure the model fit. If more measures are available, the results will be more robust.

Anticipated challenges

First, the original data analysis was conducted via SAS, and changing to R could induce discrepancies in results.

Secondly, I have not tried growth curve models, and multilevel analysis was still a little new to me, so I need help from either the class or the authors.

Thirdly, I have not encountered the implementation used to account for missing data in the paper, so I need to learn more about this method and its implications.

Methods

Description of the steps required to reproduce the results

  1. First, I will review the datasets and the codebook, and remove irrelevant variables.

  2. Next, I will conduct a round of exploratory analysis, such as finding the number of students involved in each group and average change in scores over time.

  3. The next part is to reproduce the main result – finding the fit for eight growth curve models. I will first specify the parameters of each model, use R packages to estimate the models, and find the best fit with \(\Delta AICc\), as the paper suggested. Potentially I could use other methods to compare the fit of the models, to ensure the robustness of the result.

If time allows, I will also reproduce the main effect of the intervention, using the best fit models from (4).

Differences from original study

  1. The authors used SAS to run the analyses. Since I will use R, I expect some discrepencies in results.

  2. The first author mentioned to me that they made nuanced decisions about missing data, imputations and outliers that might not be explicit in the paper.

I expect that these two differences will make it hard to reproduce the exact values in the results. However, if the results are robust enough, I should be able to come to the same conclusions (i.e. what model is the best fit for each DV?), which is my goal of this project.

Measure of success

  1. Identify all relevant variables.

  2. Missing data should be adequately handled.

  3. The models should be constructed correctly with all nested variables specified. (Coefficients should be similar to the paper’s results)

  4. The best fit model found by reproduction should be consistent with the paper’s results.

Pipeline progress

Below is a summary of the dataset.

Timeline

Year 1: Fall 2012, Spring 2013

Year 2: Fall 2013, Spring 2014

Identifier

Variables:

  • StudentID: 8-digit number
  • TeacherID: 6-digit number
  • SchoolID: 3-digit number
  • District: string names
  • Site: string, ASU or UW

Each variable is nested within the later variable.

The condition (Delayed Start vs. Early Start) is manipulated on the school level.

Condition

Variable:

  • Condition: string, DELAYED START or EARLY START

Outcome

DESSA

36 items, teacher-reported rating scale of student’s social emotional competencies

Range: 0 - 4

Subscales:

  • skills for learning (1,6,7,9,12,14,16,20,23)
  • empathy (5,8,17,18,22,26,29,34,35)
  • emotion management (2,3,15,21,25,27,28,31,26)
  • problem solving (4,10,11,13,19,24,30,32,33)
  • social-emotional composite (aggregate)

Variables:

  • DESSA1 - DESSA36: individual item score
SDQ

25 items, teacher-reported scale of student’s behavior

Range: 0 (not true) - 2 (certainly true)

Subscales:

  • peer problems (6, 11r, 14r, 19, 23)
  • hyperactivity (2, 10, 15, 21r, 25r)
  • conduct problems (5, 7r, 12, 18, 22)
  • prosocial (1, 4, 9, 17, 20)
  • emotional symptoms (3, 8, 13, 16, 24)

Variables:

  • SDQ1 - SDQ25: individual item score
Academic Engagement Time

Percent intervals of academic engagement time, from direct observation

Range: 0 - 1

Variable:

  • AET
Disruptive Behavior

Percent intervals of disruptive behavior, from direct observation

Range: 0 - 1

  • DB
M-CBM

Percentage correct on M-CBM math assessment

no granular data

Variable:

  • MATHPC: total problems correct in M-CBM
  • MATHPCT: percent problems correct in M-CBM
  • MATHDC: count of digits correct in M-CBM
  • MATHDCM: digits correct per minute
R-CBM

Oral Reading Fluency Words Reading Correct Per Minute

“Because reading instruction focused on connected text generally does not begin until first grade, we expected different growth patterns for R-CBM among students who began the study in kindergarten than those who began the study in Grades 1 or 2.”

Range: (Spring 2013) 0 - 229

Variable:

  • ORFPM: words reading correct per minute

Other Variables of Interest

  • StudentRaceS: string, student race ((1) Asian (2) Native Hawaiian or Pacific Islander (3) Black or African American(4) American Indian or Alaska Native (5) Caucasian or White (6) More than one Race (7) Hispanic)
  • Sex: binary 0 boy/1 girl, student’s gender
  • StudentGrade: student’s grade
  • Stud_FRLunchS: whether student is in the free lunch program

(Summary statistics of these variables are under exploratory_analysis.Rmd.)

Results

Data preparation

The data cleaning codes are placed in data_wrangling.Rmd under the analysis folder. The cleaned dataset is data_average.csv under the data folder. Due to space limit, I will only describe the wrangling process here:

  1. Use the foreign package to convert .sav file to .csv file. They are stored as data.sav (only locally due to its large size) and data.csv in the data folder. This is a wide dataset with each test item at each time point as a column.

  2. Select desired variables, and pivot the table longer.

  3. Remove an outlier that might result from data entry error. Reverse the score for reversely-coded test items.

  4. Then, find the aggregate score for each measure. For each model type, create a time variable with the coding method described in the paper.

Below is the clean dataset:

data_average <- read_csv(here("data", "data_average.csv"))
head(data_average)
## # A tibble: 6 x 29
##   StudentID SchoolID Condition   Site  District StudentGender StudentRaceS
##       <dbl>    <dbl> <chr>       <chr> <chr>    <chr>         <chr>       
## 1 101101125      101 Early Start UW    Renton   Male          (1) Asian   
## 2 101101125      101 Early Start UW    Renton   Male          (1) Asian   
## 3 101101125      101 Early Start UW    Renton   Male          (1) Asian   
## 4 101101125      101 Early Start UW    Renton   Male          (1) Asian   
## 5 101101189      101 Early Start UW    Renton   Female        (7) Hispanic
## 6 101101189      101 Early Start UW    Renton   Female        (7) Hispanic
## # ... with 22 more variables: Stud_FRLunchS <dbl>, StudentGrade <chr>,
## #   time <dbl>, time_A <dbl>, time_B <dbl>, time_C <dbl>, time_D <dbl>,
## #   time_E <dbl>, orfpm <dbl>, mathdcm <dbl>, db <dbl>, aet <dbl>,
## #   sdq_emotion <dbl>, sdq_social <dbl>, sdq_conduct <dbl>, sdq_hyper <dbl>,
## #   sdq_peer <dbl>, dessa_composite <dbl>, dessa_solve <dbl>,
## #   dessa_emotion <dbl>, dessa_empathy <dbl>, dessa_skill <dbl>

Key analysis

The key analysis is to find the best fit linear mixed-effect model for each outcome variable out of eight possible options. As shown in Figure 1, Model A assumes same growth from each time point to the next; B only predicts growth during the school year, and scores remain the same from the end of spring to start of fall (T2 and T3); C assumes growth during Year 1 (T1 to T2), but the gain stops afterward; D assumes that there is growth during the school year, but students’ scores drop during the summer (T2 and T3); finally, E adds a quadratic term to A.

Model Set 1 and 2 differ in how the control group’s performances are estimated. Model Set 1 assumes that the control group’s growth can be predicted linearly, while Model Set 2 uses the same piecewise slope for controls as for the intervention group.

Figure 1. Model demonstration from the paper.

Since the authors did not make clear in the paper what fixed and random effects are, I tried several versions using the lme4 package:

  1. Test score ~ Condition x Time + Time | School ID + 1 | Student ID

Fixed effects of condition, time and their interaction; random slope of time for each school; random intercept for each student; students and schools are crossed (I think?).

  1. Test score ~ Condition x Time + Time | School ID / Student ID

Fixed effects of condition, time and their interaction; random slope of time for each student; students are nested within schools.

  1. Test score ~ Condition x Time + 1 | School ID / Student ID

Fixed effects of condition, time and their interaction; random intercept for each student; students are nested within schools.

When modeling E2, a quadratic Time term is added as a fixed effect.

The last formula provides the most approximate result. It also produces fewest convergent issues. Therefore, I choose to use this formula when fitting the models.

In the effort of handling missing data, I follow the paper’s method and use maximum likelihood estimation. However, I can’t use auxiliary variables during the process, as the authors did not mention the exact auxiliary variables utilized for estimating each dependent variable.

After specifying the function, I use a for loop to fit each model. In each iteration, I also use a function from AICcmodavg to calculate \(AICc\). Below is the code for estimating Model Set 2.

time_vars_2 = c("time_B","time_C","time_D","time_E")
tests = c(colnames(data_average)[17:29])
mod_result <- as.data.frame(matrix(NA, 140, 5)) %>%
  rename(model_num = V1,
         model_set = V2,
         test_question = V3,
         aicc_value = V4,
         coef_list = V5)

n <- 0

for (time_var in time_vars_2) {
  
  for (test in tests){
    n <- n + 1
    mod <- run_lmer(time_var, test, data_average)
    aicc <- AICcmodavg::AICc(mod, return.K = F, second.ord = T)

    mod_result$model_num[n] <- time_var
    mod_result$model_set[n] <- 2
    mod_result$test_question[n] <- test
    mod_result$aicc_value[n] <- aicc
    mod_result$coef_list[n] <-  list(c(unname(fixef(mod))))
  }
}

As a side note, the authors grouped students into kindergarten versus elementary school when testing the reading score, as reading instruction is not introduced to children until their first grade. Therefore, I split data into two groups and their models on reading are estimated separately.

Then, using the following code, I calculate \(\Delta AICc\) for each model by finding the minimum \(AICc\) value out of eight models, and subtract it from each model’s \(AICc\).

mod_result <- mod_result %>%
  group_by(test_question) %>%
  mutate(min_value = min(aicc_value)) %>%
  ungroup() %>%
  mutate(diff = aicc_value - min_value) 

The model comparison is demonstrated in the following data frame, in which my \(\Delta AICc\) values are listed under the diff column, and the results from the original paper are in paper_diff.

mod_result_compare <- read_csv(here("data","mod_result_compare.csv"))
## 
## -- Column specification --------------------------------------------------------
## cols(
##   model_letter = col_character(),
##   model_set = col_double(),
##   test_question = col_character(),
##   aicc_value = col_double(),
##   min_value = col_double(),
##   diff = col_double(),
##   model_name = col_character(),
##   paper_diff = col_double()
## )
head(mod_result_compare)
## # A tibble: 6 x 8
##   model_letter model_set test_question aicc_value min_value  diff model_name    
##   <chr>            <dbl> <chr>              <dbl>     <dbl> <dbl> <chr>         
## 1 B                    2 mathdcm           80502.    80502.  0    mathdcm_B2    
## 2 B                    2 db               205032.   205029.  2.42 db_B2         
## 3 B                    2 aet              221982.   221960. 21.0  aet_B2        
## 4 B                    2 sdq_emotion       95313.    95310.  2.92 sdq_emotion_B2
## 5 B                    2 sdq_social       103794.   103714. 80.7  sdq_social_B2 
## 6 B                    2 sdq_conduct       86574.    86570.  3.66 sdq_conduct_B2
## # ... with 1 more variable: paper_diff <dbl>

To further compare my replications with the original study findings, I create scatter plots for each dependent variable, where each dot represents a model. The x-axis is the \(\Delta AICc\) value of that model calculated by me, and the y-axis is the number recorded on the paper.

make_aicc_plot <- function(test) {
  data <- mod_result_compare %>% filter(test_question == test) %>%
    mutate(model_name_short = paste(model_letter, model_set, sep = ""))
  ggplot(data, aes(x = diff, y = paper_diff, color = model_name_short,
                   label = model_name_short)) +
  geom_point() +
  coord_fixed(ratio = 1) +
  geom_abline(slope=1) +
  labs(title = paste(test),
       x = "replication AICc diff",
       y = "paper AICc diff") +
  geom_text(aes(label=model_name_short),hjust= 0, vjust=0) + 
  theme_classic() +
  theme(legend.position = "none")
}
par(mfrow = c(5, 3))
for (i in unique(mod_result_compare$test_question)) {
  print(make_aicc_plot(i))
}

The 45 degree line shows where points should be if my replication findings are exactly the same as the paper’s. If any point is at the origin, then the paper and my replication agree on the same best fit model for that dependent variable. In sum, I am able to replicate 12 out of the 15 model selection decisions.

Some trends can be spotted from these scatter plots. First, both academic engaged time (aet) and disruptive behavior (db) seem to produce the most diverging \(\Delta AICc\)s, especially for model D2. They both happen to be observation measures that were tested in the same settings. For academic tests – maths (mathdcm) and reading (orfpm_Eand orfpm_K), models such as C2 and D2 deviate from the 45 degree line greatly. Additionally, some measures in SDQ and DESSA such as SDQ conduct problems and DESSA empathy show systematically greater \(\Delta AICc\) values from the paper.

Exploratory analyses

I continue to plot the growth curve predictions given the coefficients estimated from my reproduction and from the paper. Below is the set of plots for models that both my and paper’s results agree to be the best fits.

plot_growth_df <- read_csv(here("data","plot_growth_df.csv"))
colors <- c("my" = "black", "paper" = "red")
# "actual" = "green" 
plot_growth <- function(mod_name) {
  

  plot <-ggplot(plot_growth_df %>% filter(model_name == paste0(mod_name)),
       aes(x = time_point)) +
    geom_line(aes(group=1, y = my_early,color = "my", linetype = "early"),  size = 0.8) +
    geom_line(aes(group=1, y = my_delayed, color = "my", linetype = "delayed"),  size = 0.8) +
    geom_line(aes(group=1, y = paper_early,  color = "paper", linetype = "early"),  size = 0.8) +
    geom_line(aes(group=1, y = paper_delayed, color = "paper", linetype = "delayed"), size = 0.8) +
    theme_bw() +
    labs(x = "Time Point",
       y = "Score",
       title = mod_name) +
    geom_point(aes(y = my_early), color = "black")+
    geom_point(aes(y = my_delayed), color = "black") +
    geom_point(aes(y = paper_early), color = "red") +
    geom_point(aes(y = paper_delayed), color = "red") +
    scale_color_manual(values = colors)
  print(plot)

}
correct_model_intercept <- plot_growth_df %>%
  filter(agreement == "both")
for (i in unique(correct_model_intercept$model_name)) {
  plot_growth(i)
}

These plots demonstrate that the replication results are very similar to the original study results, as black lines representing the replication findings and the red lines representing the paper’s findings are close to each other.

Figure 2. Growth curve predictions for disruptive behavior and actual mean scores.(code available in main_analysis.Rmd)

However, when different models are chosen as best models, the alignment is no longer visible. For example, the curves from the original study and reproduction are drastically different for disruptive behavior (see Figure 2). The paper chooses B2, while my analysis chooses C2 as the best model. After adding the mean scores for both conditions at every time point to the plot, however, I notice that neither of B2 and C2 are good models to predict scores of disruptive behavior. The mean scores for the intervention group (early start) shows a somewhat D2-like pattern, implying that disruptive behavior in the classroom increased for children in the SEL curriculum when the summer finished, but declined again at the end of the second year. On the other hand, the students’ disruptive behavior in the control group peaked at the end of the first year, and declined as they grew older. None of the eight models prooposed by the author could plot these patterns at the same time.

Besides these main analyses, I’ve also done descriptive analyses on the dataset. First, I compute the mean, standard deviation, sample size, missing data rate for each test at each time, separated by conditions. A direct comparison with the original paper’s statistics (in the supplemental material) demonstrates that the sample size is slightly smaller in my clean dataset, implying that the authors removed a few children due to various reasons.

data_average %>%
  pivot_longer(cols = orfpm:dessa_skill,
               names_to = "test_question",
               values_to = "score") %>%
  group_by(test_question, Condition, time) %>%
  summarize(n = sum(is.na(score) == F),
            mean=mean(score, na.rm = T),
            sd = sd(score, na.rm = T),
            missing_rate = sum(is.na(score)== T)/n()) %>%
  kable(caption = "Table 1. Descriptive statistics for each tests, separated by condition and time. ") %>%
  collapse_rows(columns = 1, valign = "top")
## `summarise()` has grouped output by 'test_question', 'Condition'. You can override using the `.groups` argument.
Table 1. Descriptive statistics for each tests, separated by condition and time.
test_question Condition time n mean sd missing_rate
aet Delayed Start 1 3268 83.341799 20.186057 0.2773109
Delayed Start 2 2999 80.087918 22.457127 0.3367979
Delayed Start 3 2983 80.327348 21.266099 0.3403361
Delayed Start 4 2973 79.986882 22.417906 0.3425475
Early Start 1 3322 81.903198 20.672618 0.2798613
Early Start 2 3079 79.739821 22.567112 0.3325385
Early Start 3 3167 81.438637 20.833767 0.3134620
Early Start 4 3010 81.154214 21.486995 0.3474962
db Delayed Start 1 3269 8.815753 14.370629 0.2770898
Delayed Start 2 2998 9.602048 16.502955 0.3370190
Delayed Start 3 2983 9.280352 15.517967 0.3403361
Delayed Start 4 2973 9.078260 15.819513 0.3425475
Early Start 1 3330 9.531577 15.500510 0.2781270
Early Start 2 3079 8.607429 14.647855 0.3325385
Early Start 3 3160 9.116886 15.402172 0.3149794
Early Start 4 3010 8.755512 15.304671 0.3474962
dessa_composite Delayed Start 1 3064 94.243799 26.136042 0.3224237
Delayed Start 2 2738 101.794010 26.069872 0.3945157
Delayed Start 3 2803 99.496611 25.153311 0.3801415
Delayed Start 4 2880 101.717361 25.730278 0.3631137
Early Start 1 3321 96.097260 25.570076 0.2800780
Early Start 2 2864 106.480098 25.097051 0.3791459
Early Start 3 2940 100.432993 24.907968 0.3626707
Early Start 4 2820 105.217021 25.748608 0.3886842
dessa_emotion Delayed Start 1 3085 23.474878 6.543759 0.3177797
Delayed Start 2 2742 25.247629 6.662086 0.3936311
Delayed Start 3 2806 24.527798 6.373836 0.3794781
Delayed Start 4 2882 25.268217 6.618730 0.3626714
Early Start 1 3324 24.115824 6.347790 0.2794277
Early Start 2 2871 26.604319 6.342559 0.3776284
Early Start 3 2960 25.012500 6.172154 0.3583351
Early Start 4 2834 26.420607 6.489140 0.3856493
dessa_empathy Delayed Start 1 3141 23.154091 7.302853 0.3053958
Delayed Start 2 2758 25.311095 6.992277 0.3900929
Delayed Start 3 2819 24.826179 6.779997 0.3766033
Delayed Start 4 2896 25.292818 6.940649 0.3595754
Early Start 1 3342 23.600239 6.995229 0.2755257
Early Start 2 2893 26.537850 6.761986 0.3728593
Early Start 3 2999 24.853951 6.796234 0.3498808
Early Start 4 2878 26.177206 6.847655 0.3761110
dessa_skill Delayed Start 1 3143 24.488705 7.331544 0.3049536
Delayed Start 2 2757 26.103373 7.313614 0.3903140
Delayed Start 3 2823 25.413390 7.194734 0.3757187
Delayed Start 4 2898 25.946860 7.211290 0.3591331
Early Start 1 3360 24.579167 7.159912 0.2716237
Early Start 2 2914 26.947495 6.886231 0.3683070
Early Start 3 3008 25.500332 7.187775 0.3479298
Early Start 4 2878 26.472203 7.240716 0.3761110
dessa_solve Delayed Start 1 3134 23.317167 6.927621 0.3069438
Delayed Start 2 2759 25.179413 6.910864 0.3898717
Delayed Start 3 2822 24.710135 6.661430 0.3759398
Delayed Start 4 2896 25.169544 6.816665 0.3595754
Early Start 1 3342 23.766906 6.723500 0.2755257
Early Start 2 2882 26.297710 6.607101 0.3752439
Early Start 3 2988 25.026774 6.585226 0.3522653
Early Start 4 2863 26.028292 6.868453 0.3793627
mathdcm Delayed Start 1 3275 1.554898 1.437160 0.2757629
Delayed Start 2 3077 3.173817 1.617941 0.3195489
Delayed Start 3 3045 3.335304 1.465916 0.3266254
Delayed Start 4 2921 4.658465 1.452153 0.3540469
Early Start 1 3361 1.486989 1.454646 0.2714069
Early Start 2 3145 3.123084 1.694985 0.3182311
Early Start 3 3161 3.299510 1.466015 0.3147626
Early Start 4 2983 4.631830 1.394130 0.3533492
orfpm Delayed Start 1 3265 24.303216 33.398587 0.2779743
Delayed Start 2 3029 48.397491 42.280435 0.3301636
Delayed Start 3 3035 60.635255 44.578511 0.3288368
Delayed Start 4 2934 83.632243 41.825960 0.3511720
Early Start 1 3326 22.722790 33.739874 0.2789941
Early Start 2 3110 48.604502 45.104566 0.3258183
Early Start 3 3145 57.881399 45.667438 0.3182311
Early Start 4 2971 82.048132 42.549136 0.3559506
sdq_conduct Delayed Start 1 3148 1.050508 1.739188 0.3038479
Delayed Start 2 2764 1.096961 1.806612 0.3887660
Delayed Start 3 2845 1.018278 1.720139 0.3708536
Delayed Start 4 2896 1.245856 1.966116 0.3595754
Early Start 1 3361 1.066052 1.675039 0.2714069
Early Start 2 2896 1.068370 1.723601 0.3722090
Early Start 3 3012 1.014940 1.653105 0.3470626
Early Start 4 2883 1.116545 1.818904 0.3750271
sdq_emotion Delayed Start 1 3148 1.152160 1.875377 0.3038479
Delayed Start 2 2765 1.321881 1.925487 0.3885449
Delayed Start 3 2849 1.374868 1.968827 0.3699690
Delayed Start 4 2901 1.574285 2.093277 0.3584697
Early Start 1 3352 1.249403 1.916295 0.2733579
Early Start 2 2906 1.213696 1.853897 0.3700412
Early Start 3 3017 1.293669 1.904884 0.3459788
Early Start 4 2884 1.362344 1.981562 0.3748103
sdq_hyper Delayed Start 1 3148 3.358958 3.121030 0.3038479
Delayed Start 2 2764 3.241317 3.125904 0.3887660
Delayed Start 3 2849 3.457353 3.133111 0.3699690
Delayed Start 4 2900 3.428965 3.168291 0.3586908
Early Start 1 3370 3.529080 3.096913 0.2694559
Early Start 2 2914 3.109128 3.002305 0.3683070
Early Start 3 3017 3.458071 3.124778 0.3459788
Early Start 4 2885 3.302253 3.133885 0.3745935
sdq_peer Delayed Start 1 3143 1.413618 1.675477 0.3049536
Delayed Start 2 2762 1.247647 1.643598 0.3892083
Delayed Start 3 2843 1.292649 1.653959 0.3712959
Delayed Start 4 2899 1.322180 1.710912 0.3589120
Early Start 1 3340 1.457784 1.657843 0.2759592
Early Start 2 2885 1.177816 1.569891 0.3745935
Early Start 3 3007 1.389425 1.611975 0.3481465
Early Start 4 2883 1.312522 1.637034 0.3750271
sdq_social Delayed Start 1 3121 7.012496 2.508735 0.3098187
Delayed Start 2 2762 7.657495 2.369979 0.3892083
Delayed Start 3 2814 7.582445 2.358274 0.3777090
Delayed Start 4 2893 7.690633 2.393955 0.3602388
Early Start 1 3330 7.144444 2.483682 0.2781270
Early Start 2 2880 7.739236 2.336057 0.3756774
Early Start 3 3003 7.432234 2.365993 0.3490137
Early Start 4 2865 7.669110 2.356223 0.3789291

I also replicated the correlation plot from the paper. (Note: reading score orfpm is removed from the plot because the paper used data filtered by student grade, but it is hard to do so here).

Figure 3. Correlation plot at Time 1.(code available in exploratory_analysis.Rmd )

Figure 4. Correlation plot at Time 4. (code available in exploratory_analysis.Rmd )

Table 2. Correlation plot from the paper. Upper right: Time 1; bottom left: Time 4.

The correlation coefficients are not exactly the same, but in similar ranges. DESSA measures have strong intercorrelations, but the correlations within SDQ are a little weaker. Classroom behavior measures (db, aet) and maths score (mathdcm) are only weakly correlated with other variables most of the time.

Discussion

Summary of Reproduction Attempt

In sum, I am able to replicate 12 out of the 15 model selection decisions. In that sense, I partially reproduce the results from the paper. However, since the paper and I cannot agree on the fit of models for three outcome variables, and some \(\Delta AICc\) values vary drastically even if the best fit model converges, I suspect that there are some systematic errors underneath my analysis, but I haven’t been able to spot them. Possible reasons that lead to variations could include: 1) the fixed and random effects are not specified in the way the paper does; 2) Removing the auxiliary variables in FIML estimation could cause differences; 3) The decisions of removing outliers are unclear, and therefore we might have been using slightly different datasets.

Commentary

Here are some of the things I notice and contemplate on when working on this project:

  1. Interestingly, the authors in fact collected data from five time points! They measured everything again in Year 3. I’m interested in how results could have changed if the models are estimated with five waves, and how many more possible models could be proposed.

  2. The above discussion connects to this concern: with so many assumptions being made on the models, do the models suffer from overfitting? If the dataset is extended to 10 years, do the score fluctuations that we capture in these models actually matter?

  3. One question that bugs me is the implication of this research. At this point, I can say with a lot of confidence that a simple linear growth model assuming a same slope over time cannot work well. However, I find it hard to justify why the developmental trajectory of a certain feature should be estimated by, for instance, D2, but not C2. And I doubt if these results can be replicated with a different sample.

  4. Due to time constraint, I wasn’t able to reproduce the main effects of the SEL curriculum. But looking at the paper’s conclusions, with an effect size of 0.10, I am not sure if SEL intervention is trully effective. I wonder if it could have a even longer term benefits that can’t be captured over the course of two years.

  5. I am grateful that I have received a lot of help from the authors. I especially appreciate the long email from Keith that answered my questions with so many details.

  6. And finally, thank you, the PSYCH 251 teaching team!! I’m probably one of the few people, if not none, who don’t have a lab or PI to rely on, and I can’t complete this project without all of your help. I want to thank Julie for the exhaustive feedback, Sarah for helping me with debugging, Effie for answering my question during OH, and Mike for those extended office hours to help me with everything!!

This project is written in R, with the help of the following packages: tidyverse, knitr, lme4, here, kableExtra, readr, foreign, AICcmodavg.