Introduction

Social-emotional learning (SEL) programs have recently gained increasing attention. Therefore, it is vital to evaluate the program’s efficacy in promoting social-emotional and academic skills. Low et al. (2019) utilized a two-year dataset from Second Steps that recorded behavioral reports and observations on over 9,000 students from 61 schools and 7 districts.

The longitudinal data allowed the researchers to investigate the type of growth curve models that could explain the developmental trajectories the best. The study found out that in most measurements, students’ performance only improved during school, not summer. Thus, the “summer slide” phenomenon was evident, and a linear growth model was a poor fit.

The researchers continued to use these models to examine the efficacy of the Second Steps program. Students from the experimental group outperformed those in the control group in emotional symptoms, social-emotional scores, skills for learning, and emotional management. However, the program did not show to improve students’ academic performance, academic engagement, and disruptive behavior, and most effect sizes of the results were relatively small.

The current project aims to reproduce the main results from this paper: 1) find the optimal growth curve model for each measurement; and 2) evaluate the program’s efficacy in SEL and academic measures. If possible, I would also like to test the model fit via other measures, though it is outside the scope of my current statistical knowledge.

Justification for choice of study

The linear growth curve model is widely used to estimate student’s performance over time. If it does not accurately describe the pattern like Low et al. suggested (2019), further longitudinal research on student program outcomes should pursue other models.

Additionally, the paper only used one method to measure the model fit. If more measures are available, the results will be more robust.

Anticipated challenges

First, the original data analysis was conducted via SAS, and changing to R could induce discrepancies in results.

Secondly, I have not tried growth curve models, and multilevel analysis was still a little new to me, so I need help from either the class or the authors.

Thirdly, I have not encountered the implementation used to account for missing data in the paper, so I need to learn more about this method and its implications.

Links

Project repository (on Github): https://github.com/psych251/low2019.git

Original paper (as hosted in your repo): https://github.com/psych251/low2019/blob/cbde9963c81eae601a1597c40d561b6119235435/original_paper/low2019.pdf

Methods

Description of the steps required to reproduce the results

(I haven’t received the data, so these are all conjectures)

First, I will review the datasets and the codebook, and remove irrelevant variables.
Then, I will use FIML estimation to to impute missing data. If time allows, I will also reproduce the procedure of finding the variables that correlate with missingness.
Next, I will conduct a round of exploratory analysis, such as finding the number of students involved in each group and average change in scores over time.
The next part is to reproduce the main result – finding the fit for eight growth curve models. I will first specify the parameters of each model, use R packages (maybe lavaan?) to estimate the models, and find the best fit with \(\Delta AICc\) and Akaike weights, as the paper suggested. Potentially I could use other methods to compare the fit of the models, to ensure the robustness of the result.

If time allows, I will also reproduce the main effect of the intervention, using the best fit models from (4).

Differences from original study

The authors used SAS to run the analyses. Since I will use R, I expect some discrepencies in results.
The first author mentioned to me that they made nuanced decisions about missing data, imputations and outliers that might not be explicit in the paper.

I expect that these two differences will make it hard to reproduce the exact values in the results. However, if the results are robust enough, I should be able to come to the same conclusions (i.e. what model is the best fit for each DV?), which is my goal of this project.

Project Progress Check 1

Measure of success

Identify all relevant variables.
Recalculate the reliability of key outcome measures. (cronbach’s alpha should be identical to the numbers reported in the paper)
Missing data should be adequately imputed. Question: if missingness is systematic (e.g. one teacher didn’t fill out assessments for all of their students), is it ok to impute missing data, or should the student data be dropped? (Either using the author’s method, or finding an alternative; I can’t find a way to measure whether it’s successful)
The models should be constructed correctly with all nested variables specified. (Coefficients should be similar to the paper’s results)

Pipeline progress

I am still at Step 1. The dataset is unexpectedly huge (1300+ columns!!), the analysis script is unavailable, and the codebook is not clear enough. I have to figure out a lot of things by myself. Below is my progress of summarizing variables of interest.

Timeline

Year 1: Fall 2012, Spring 2013

Year 2: Fall 2013, Spring 2014

Identifier

Variables:

StudentID: 8-digit number
TeacherID: 6-digit number
SchoolID: 3-digit number
District: string names
Site: string, ASU or UW

Each variable is nested within the later variable.

The condition (Delayed Start vs. Early Start) is manipulated on the school level.

Condition

Variable:

Condition: string, DELAYED START or EARLY START

Outcome

DSSA

36 items, teacher-reported rating scale of student’s social emotional competencies

Range: 0 - 4

Subscales:

skills for learning (1,6,7,9,12,14,16,20,23)
empathy (5,8,17,18,22,26,29,34,35)
emotion management (2,3,15,21,25,27,28,31,26)
problem solving (4,10,11,13,19,24,30,32,33)
social-emotional composite (aggregate)

Variables:

DESSA1 - DESSA36: individual item score

SDQ

25 items, teacher-reported scale of student’s behavior

Range: 0 (not true) - 2 (certainly true)

Subscales:

peer problems (6, 11r, 14r, 19, 23)
hyperactivity (2, 10, 15, 21r, 25r)
conduct problems (5, 7r, 12, 18, 22)
prosocial (1, 4, 9, 17, 20)
emotional symptoms (3, 8, 13, 16, 24)

Variables:

SDQ1 - SDQ25: individual item score

Academic Engagement Time

Percent intervals of academic engagement time, from direct observation

Range: 0 - 1

Variable:

Disruptive Behavior

Percent intervals of disruptive behavior, from direct observation

Range: 0 - 1

M-CBM

Percentage correct on M-CBM math assessment

no granular data

Variable:

MATHPC: total problems correct in M-CBM
MATHPCT: percent problems correct in M-CBM
MATHDC: count of digits correct in M-CBM
MATHDCM: digits correct per minute

R-CBM

Oral Reading Fluency Words Reading Correct Per Minute

“Because reading instruction focused on connected text generally does not begin until first grade, we expected different growth patterns for R-CBM among students who began the study in kindergarten than those who began the study in Grades 1 or 2.”

Range: (Spring 2013) 0 - 229

Variable:

ORFPM: words reading correct per minute

Other Variables of Interest

StudentRaceS: string, student race ((1) Asian (2) Native Hawaiian or Pacific Islander (3) Black or African American(4) American Indian or Alaska Native (5) Caucasian or White (6) More than one Race (7) Hispanic)
Sex: binary 0 boy/1 girl, student’s gender
StudentGrade: student’s grade
StudentAbsences: number of school days missed for a student
StudentSuspensions: number of suspensions for a student
PCMTOTAL: teacher proactive classroom management score
TSESTS: teacher self efficacy total score
SchoolPctFRL: percentage of students with free and reduced lunch in a school

Questions I have now:

Some students were taught by different teachers in Year 1 and Year 2. What should I do??
Sometimes, students’ grade didn’t increment by 1 as they went from Year 1 to Year 2.
Many variables that will be interesting to look at have a lot of missing values (about 1/3). Is it not a good idea to use them?
How should the dataframe be structured? I assume that each student should have four rows, one for each time point. Each row contains public variables like studentID, gender, etc, and student’s outcomes for each time point.

Project Progress Check 2

First, I encountered problems when checking the item reliabilities using the psych package. Mike suggested the agreement package, but there isn’t enough explanation about this package for me to implement. For example, I’m not sure which model to use in the dim_icc function. Also, this function seems to calculate inter-rater reliability, and I’m not sure if it can be translated directly to a longitudinal dataset.

Then, I tried to work on fitting the models. I found out the package lavaan has an argument about how to deal with missing data, and full information maximum likelihood estimation is one of the methods. However, I still have the following questions:

I can’t find the way to specify auxillary variables for FIML. Though the author mentioned that there isn’t a lot of difference in adding the auxillary variables vs. not, I am still curious if there’s a way I could implement it in R.
FIML is embedded in a modeling function in lavaan. For example, I used the growth function below, but I don’t know if that’s the correct function I should use. The best way is to retrieve the modified dataset after FIML estimation, but to my knowledge lavaan can’t do this.
I fit a model below, but I need more help interpreting what each coefficient and variable mean. I’d also like to know how to manipulate this model to achieve diferent piecewise models described in the paper. I also need help understanding the output.
Finally, I’ve found a package that can calculate AICc for me (MuMIn), and it works with a lavaan object.

Results

Data preparation

Data preparation following the analysis plan.

library(here)

## Warning: package 'here' was built under R version 4.0.4

## here() starts at C:/Users/iris_/OneDrive - Stanford/Stanford/Autumn 2021/PSYCH 251/low2019

data <- read_csv(here("data", "data_reliability.csv")) %>%
  mutate(StudentID = as.character(StudentID))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   Condition = col_character(),
##   Site = col_character(),
##   District = col_character(),
##   StudentGender = col_character(),
##   StudentRaceS = col_character(),
##   StudentGrade = col_character(),
##   DESSALowHalf = col_character(),
##   DESSALowHalfY2 = col_character(),
##   DESSAsecSignif = col_character(),
##   DESSAslSignif = col_character(),
##   DESSAsaeSignif = col_character(),
##   DESSAemSignif = col_character(),
##   DESSApsSignif = col_character(),
##   SDQesSignif = col_character(),
##   SDQcpSignif = col_character(),
##   SDQhsSignif = col_character(),
##   SDQppSignif = col_character(),
##   SDQpsSignif = col_character(),
##   ODBSignif = col_character()
## )
## i Use `spec()` for the full column specifications.

dessa_wide <- data %>%
  dplyr::select(StudentID, SchoolID, Condition, ADESSA1:ADESSA36, BDESSA1:BDESSA36, 
                CDESSA1:CDESSA36, DDESSA1:DDESSA36)

dessa <- dessa_wide %>%
  pivot_longer(cols = -c(StudentID, SchoolID, Condition), names_pattern = "^(.)(.*)$",
               names_to = c("time", "item")) %>%
  pivot_wider(id_cols = c(StudentID, SchoolID, Condition, time), names_from = item, 
              values_from = value, names_repair = "check_unique")

dessa_skill <- dessa %>%
  dplyr::select(StudentID, SchoolID, Condition, time, DESSA1, DESSA6, DESSA7,
                DESSA9, DESSA12, DESSA14, DESSA16, DESSA20, DESSA23, 
                row.names = NULL) %>%
  mutate(time = as.factor(case_when(time == "A" ~ 1,
                          time == "B" ~ 2,
                          time == "C" ~ 3,
                          time == "D" ~ 4)))

dessa_skill_avg <- dessa_skill %>%
  pivot_longer(cols = -c(StudentID, SchoolID, Condition, time),
               names_to = c("item")) %>%
  group_by(StudentID, SchoolID, Condition, time) %>%
  summarize(dessa_skill_score = mean(value, na.rm = T))

## `summarise()` regrouping output by 'StudentID', 'SchoolID', 'Condition' (override with `.groups` argument)

dessa_skill_avg_wide <- dessa_skill_avg %>%
  pivot_wider(names_from = time, names_prefix = "t", 
              values_from = dessa_skill_score)

Plotting and exploration

ggplot(dessa_skill_avg, 
       aes(x = time, y = dessa_skill_score, col = Condition)) + 
  geom_line(aes(group = StudentID), 
            alpha = .05, width = .25) + 
  stat_summary(aes(group = Condition), geom = "line") +
  theme_bw()

## Warning: Ignoring unknown parameters: width

## Warning: Removed 12660 rows containing non-finite values (stat_summary).

## No summary function supplied, defaulting to `mean_se()`

## Warning: Removed 12120 row(s) containing missing values (geom_path).

  # geom_smooth(aes(group = Condition), method = "lm")

Mike prototype in lmer

For students within intervention schools, we coded time as (A) 0, 1, 2, 3 across time for linear slopes; (B) 0, 1, 1, 2 for within-year, stepped growth; (C) 0, 1, 1, 1 for initial gains with stable scores thereafter; and (D) 0, 1, 0, 1 for gains during the school years and a decline between years.

dessa_skill_avg$time_A <- as.numeric(dessa_skill_avg$time)-1
dessa_skill_avg$time_B <- dessa_skill_avg$time_A
dessa_skill_avg$time_B[dessa_skill_avg$time_A > 1] <-
  dessa_skill_avg$time_A[dessa_skill_avg$time_A > 1] - 1
dessa_skill_avg$time_C <- dessa_skill_avg$time_B
dessa_skill_avg$time_C[dessa_skill_avg$time_B > 1] <- 1
dessa_skill_avg$time_D <- dessa_skill_avg$time_A
dessa_skill_avg$time_D[dessa_skill_avg$time_A > 1] <- 
  dessa_skill_avg$time_D[dessa_skill_avg$time_A > 1] - 2


mod_A <- lmer(dessa_skill_score ~ time_A + 
              (time_A | SchoolID) + (1 | StudentID), 
            data = dessa_skill_avg)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00495663 (tol = 0.002, component 1)

mod_B <- lmer(dessa_skill_score ~ time_B + 
              (time_B | SchoolID) + (1 | StudentID), 
            data = dessa_skill_avg)
mod_C <- lmer(dessa_skill_score ~ time_C + 
              (time_C | SchoolID) + (1 | StudentID), 
            data = dessa_skill_avg)
mod_D <- lmer(dessa_skill_score ~ time_D + 
              (time_D | SchoolID) + (1 | StudentID), 
            data = dessa_skill_avg)

AIC(mod_A)

## [1] 47812.27

Functionalize!

run_lmer <- function(time_var) {
  mod_formula <- paste("dessa_skill_score ~",time_var,"+ 
    (", time_var, "| SchoolID) + (1 | StudentID)")
  mod <- lmer(mod_formula, 
              data = dessa_skill_avg) 
  return(mod)
}

sim <- tibble(time_vars = c("time_A","time_B","time_C","time_D")) %>%
  mutate(mod = map(time_vars, run_lmer)) %>%
  mutate(aic = map(mod, AIC)) %>%
  select(-mod) %>%
  unnest(cols = c(aic))

## Warning: Problem with `mutate()` input `mod`.
## i Model failed to converge with max|grad| = 0.00495663 (tol = 0.002, component 1)
## i Input `mod` is `map(time_vars, run_lmer)`.

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00495663 (tol = 0.002, component 1)

Try https://cran.r-project.org/web/packages/AICcmodavg/AICcmodavg.pdf for AICc?

Iris’s lavaan version

Progress Checkin 3

After meeting with Mike, I was able to use lmer4 to fit the linear mix-effect models in the paper. (See results in data/main_analysis.rmd). Since lmer4 has a built-in argument to turn on maximum likelihood, the problem of missing data was resolved. Becuase there were too many outcome variables, map was used to iterate the model fitting process. I’ve now got results that are close to what I need, but here are a few things I have to do:

I’m still using AIC instead of AICc, and it’s now time to switch to AICc.
ORFPM (reading measure) is split by grade (Kindergarten vs. K-12) in the paper, and I need to recalculate AICc for this measure after a group_by.
(will need help) The current models didn’t put condition as an IV into the formula. I think it’ll be an easy thing to do for Model Set 2 in the paper, where both conditions adopt the same assumptions about the curve pattern. I’m unsure how to estimate Model Set 1, where the control condition is always fit by a linear model.
(probably need help) There is one more model that hasn’t been created yet, which assumes a quadratic relationship for both conditions. I’m guessing that it can be achieved by manipulating the time variables again to make it 1^2, 2^2, 3^2 and 4^2? But I need some confirmation here.

Key analysis

The analyses as specified in the analysis plan.

Side-by-side graph with original graph is ideal here

###Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Reproduction Attempt

Open the discussion section with a paragraph summarizing the primary result from the key analysis and assess whether you successfully reproduced it, partially reproduced it, or failed to reproduce it.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis of the dataset, (b) assessment of the meaning of the successful or unsuccessful reproducibility attempt - e.g., for a failure to reproduce the original findings, are the differences between original and present analyses ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the reproducibility attempt (if you contacted them). None of these need to be long.

Reproducibility Report for an SEL Program Efficacy Study by Low et al. (2019, Developmental Psychology)

Iris Zhong (izhong@stanford.edu)

November 17, 2021

Introduction

Justification for choice of study

Anticipated challenges

Links

Methods

Description of the steps required to reproduce the results

Differences from original study

Project Progress Check 1

Measure of success

Pipeline progress

Timeline

Identifier

Condition

Outcome

DSSA

SDQ

Academic Engagement Time

Disruptive Behavior

M-CBM

R-CBM

Other Variables of Interest

Project Progress Check 2

Results

Data preparation

Progress Checkin 3

Key analysis

Discussion

Summary of Reproduction Attempt

Commentary