Social-emotional learning (SEL) programs have recently gained increasing attention. Therefore, it is vital to evaluate the program’s efficacy in promoting social-emotional and academic skills. Low et al. (2019) utilized a two-year dataset from Second Steps that recorded behavioral reports and observations on over 9,000 students from 61 schools and 7 districts.
The longitudinal data allowed the researchers to investigate the type of growth curve models that could explain the developmental trajectories the best. The study found out that in most measurements, students’ performance only improved during school, not summer. Thus, the “summer slide” phenomenon was evident, and a linear growth model was a poor fit.
The researchers continued to use these models to examine the efficacy of the Second Steps program. Students from the experimental group outperformed those in the control group in emotional symptoms, social-emotional scores, skills for learning, and emotional management. However, the program did not show to improve students’ academic performance, academic engagement, and disruptive behavior, and most effect sizes of the results were relatively small.
The current project aims to reproduce the main results from this paper: 1) find the optimal growth curve model for each measurement; and 2) evaluate the program’s efficacy in SEL and academic measures. If possible, I would also like to test the model fit via other measures, though it is outside the scope of my current statistical knowledge.
The linear growth curve model is widely used to estimate student’s performance over time. If it does not accurately describe the pattern like Low et al. suggested (2019), further longitudinal research on student program outcomes should pursue other models.
Additionally, the paper only used one method to measure the model fit. If more measures are available, the results will be more robust.
First, the original data analysis was conducted via SAS, and changing to R could induce discrepancies in results.
Secondly, I have not tried growth curve models, and multilevel analysis was still a little new to me, so I need help from either the class or the authors.
Thirdly, I have not encountered the implementation used to account for missing data in the paper, so I need to learn more about this method and its implications.
Project repository (on Github): https://github.com/psych251/low2019.git
Original paper (as hosted in your repo): https://github.com/psych251/low2019/blob/cbde9963c81eae601a1597c40d561b6119235435/original_paper/low2019.pdf
First, I will review the datasets and the codebook, and remove irrelevant variables.
Next, I will conduct a round of exploratory analysis, such as finding the number of students involved in each group and average change in scores over time.
The next part is to reproduce the main result – finding the fit for eight growth curve models. I will first specify the parameters of each model, use R packages to estimate the models, and find the best fit with \(\Delta AICc\), as the paper suggested. Potentially I could use other methods to compare the fit of the models, to ensure the robustness of the result.
If time allows, I will also reproduce the main effect of the intervention, using the best fit models from (4).
The authors used SAS to run the analyses. Since I will use R, I expect some discrepencies in results.
The first author mentioned to me that they made nuanced decisions about missing data, imputations and outliers that might not be explicit in the paper.
I expect that these two differences will make it hard to reproduce the exact values in the results. However, if the results are robust enough, I should be able to come to the same conclusions (i.e. what model is the best fit for each DV?), which is my goal of this project.
Identify all relevant variables.
Missing data should be adequately handled.
The models should be constructed correctly with all nested variables specified. (Coefficients should be similar to the paper’s results)
The best fit model found by reproduction should be consistent with the paper’s results.
Below is a summary of the dataset.
Year 1: Fall 2012, Spring 2013
Year 2: Fall 2013, Spring 2014
Variables:
Each variable is nested within the later variable.
The condition (Delayed Start vs. Early Start) is manipulated on the school level.
Variable:
36 items, teacher-reported rating scale of student’s social emotional competencies
Range: 0 - 4
Subscales:
Variables:
25 items, teacher-reported scale of student’s behavior
Range: 0 (not true) - 2 (certainly true)
Subscales:
Variables:
Percent intervals of academic engagement time, from direct observation
Range: 0 - 1
Variable:
Percent intervals of disruptive behavior, from direct observation
Range: 0 - 1
Percentage correct on M-CBM math assessment
no granular data
Variable:
Oral Reading Fluency Words Reading Correct Per Minute
“Because reading instruction focused on connected text generally does not begin until first grade, we expected different growth patterns for R-CBM among students who began the study in kindergarten than those who began the study in Grades 1 or 2.”
Range: (Spring 2013) 0 - 229
Variable:
(Summary statistics of these variables are under exploratory_analysis.Rmd.)
The data cleaning codes are placed in data_wrangling.Rmd under the analysis folder. The cleaned dataset is data_average.csv under the data folder. Due to space limit, I will only describe the wrangling process here:
Use the foreign package to convert .sav file to .csv file. They are stored as data.sav (only locally due to its large size) and data.csv in the data folder. This is a wide dataset with each test item at each time point as a column.
Select desired variables, and pivot the table longer.
Remove an outlier that might result from data entry error. Reverse the score for reversely-coded test items.
Then, find the aggregate score for each measure. For each model type, create a time variable with the coding method described in the paper.
Below is the clean dataset:
data_average <- read_csv(here("data", "data_average.csv"))
head(data_average)
## # A tibble: 6 x 29
## StudentID SchoolID Condition Site District StudentGender StudentRaceS
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 101101125 101 Early Start UW Renton Male (1) Asian
## 2 101101125 101 Early Start UW Renton Male (1) Asian
## 3 101101125 101 Early Start UW Renton Male (1) Asian
## 4 101101125 101 Early Start UW Renton Male (1) Asian
## 5 101101189 101 Early Start UW Renton Female (7) Hispanic
## 6 101101189 101 Early Start UW Renton Female (7) Hispanic
## # ... with 22 more variables: Stud_FRLunchS <dbl>, StudentGrade <chr>,
## # time <dbl>, time_A <dbl>, time_B <dbl>, time_C <dbl>, time_D <dbl>,
## # time_E <dbl>, orfpm <dbl>, mathdcm <dbl>, db <dbl>, aet <dbl>,
## # sdq_emotion <dbl>, sdq_social <dbl>, sdq_conduct <dbl>, sdq_hyper <dbl>,
## # sdq_peer <dbl>, dessa_composite <dbl>, dessa_solve <dbl>,
## # dessa_emotion <dbl>, dessa_empathy <dbl>, dessa_skill <dbl>
The key analysis is to find the best fit linear mixed-effect model for each outcome variable out of eight possible options. As shown in Figure 1, Model A assumes same growth from each time point to the next; B only predicts growth during the school year, and scores remain the same from the end of spring to start of fall (T2 and T3); C assumes growth during Year 1 (T1 to T2), but the gain stops afterward; D assumes that there is growth during the school year, but students’ scores drop during the summer (T2 and T3); finally, E adds a quadratic term to A.
Model Set 1 and 2 differ in how the control group’s performances are estimated. Model Set 1 assumes that the control group’s growth can be predicted linearly, while Model Set 2 uses the same piecewise slope for controls as for the intervention group.
Figure 1. Model demonstration from the paper.
Since the authors did not make clear in the paper what fixed and random effects are, I tried several versions using the lme4 package:
Fixed effects of condition, time and their interaction; random slope of time for each school; random intercept for each student; students and schools are crossed (I think?).
Fixed effects of condition, time and their interaction; random slope of time for each student; students are nested within schools.
Fixed effects of condition, time and their interaction; random intercept for each student; students are nested within schools.
When modeling E2, a quadratic Time term is added as a fixed effect.
The last formula provides the most approximate result. It also produces fewest convergent issues. Therefore, I choose to use this formula when fitting the models.
In the effort of handling missing data, I follow the paper’s method and use maximum likelihood estimation. However, I can’t use auxiliary variables during the process, as the authors did not mention the exact auxiliary variables utilized for estimating each dependent variable.
After specifying the function, I use a for loop to fit each model. In each iteration, I also use a function from AICcmodavg to calculate \(AICc\). Below is the code for estimating Model Set 2.
time_vars_2 = c("time_B","time_C","time_D","time_E")
tests = c(colnames(data_average)[17:29])
mod_result <- as.data.frame(matrix(NA, 140, 5)) %>%
rename(model_num = V1,
model_set = V2,
test_question = V3,
aicc_value = V4,
coef_list = V5)
n <- 0
for (time_var in time_vars_2) {
for (test in tests){
n <- n + 1
mod <- run_lmer(time_var, test, data_average)
aicc <- AICcmodavg::AICc(mod, return.K = F, second.ord = T)
mod_result$model_num[n] <- time_var
mod_result$model_set[n] <- 2
mod_result$test_question[n] <- test
mod_result$aicc_value[n] <- aicc
mod_result$coef_list[n] <- list(c(unname(fixef(mod))))
}
}
As a side note, the authors grouped students into kindergarten versus elementary school when testing the reading score, as reading instruction is not introduced to children until their first grade. Therefore, I split data into two groups and their models on reading are estimated separately.
Then, using the following code, I calculate \(\Delta AICc\) for each model by finding the minimum \(AICc\) value out of eight models, and subtract it from each model’s \(AICc\).
mod_result <- mod_result %>%
group_by(test_question) %>%
mutate(min_value = min(aicc_value)) %>%
ungroup() %>%
mutate(diff = aicc_value - min_value)
The model comparison is demonstrated in the following data frame, in which my \(\Delta AICc\) values are listed under the diff column, and the results from the original paper are in paper_diff.
mod_result_compare <- read_csv(here("data","mod_result_compare.csv"))
##
## -- Column specification --------------------------------------------------------
## cols(
## model_letter = col_character(),
## model_set = col_double(),
## test_question = col_character(),
## aicc_value = col_double(),
## min_value = col_double(),
## diff = col_double(),
## model_name = col_character(),
## paper_diff = col_double()
## )
head(mod_result_compare)
## # A tibble: 6 x 8
## model_letter model_set test_question aicc_value min_value diff model_name
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 B 2 mathdcm 80502. 80502. 0 mathdcm_B2
## 2 B 2 db 205032. 205029. 2.42 db_B2
## 3 B 2 aet 221982. 221960. 21.0 aet_B2
## 4 B 2 sdq_emotion 95313. 95310. 2.92 sdq_emotion_B2
## 5 B 2 sdq_social 103794. 103714. 80.7 sdq_social_B2
## 6 B 2 sdq_conduct 86574. 86570. 3.66 sdq_conduct_B2
## # ... with 1 more variable: paper_diff <dbl>
To further compare my replications with the original study findings, I create scatter plots for each dependent variable, where each dot represents a model. The x-axis is the \(\Delta AICc\) value of that model calculated by me, and the y-axis is the number recorded on the paper.
make_aicc_plot <- function(test) {
data <- mod_result_compare %>% filter(test_question == test) %>%
mutate(model_name_short = paste(model_letter, model_set, sep = ""))
ggplot(data, aes(x = diff, y = paper_diff, color = model_name_short,
label = model_name_short)) +
geom_point() +
coord_fixed(ratio = 1) +
geom_abline(slope=1) +
labs(title = paste(test),
x = "replication AICc diff",
y = "paper AICc diff") +
geom_text(aes(label=model_name_short),hjust= 0, vjust=0) +
theme_classic() +
theme(legend.position = "none")
}
par(mfrow = c(5, 3))
for (i in unique(mod_result_compare$test_question)) {
print(make_aicc_plot(i))
}
The 45 degree line shows where points should be if my replication findings are exactly the same as the paper’s. If any point is at the origin, then the paper and my replication agree on the same best fit model for that dependent variable. In sum, I am able to replicate 12 out of the 15 model selection decisions.
Some trends can be spotted from these scatter plots. First, both academic engaged time (aet) and disruptive behavior (db) seem to produce the most diverging \(\Delta AICc\)s, especially for model D2. They both happen to be observation measures that were tested in the same settings. For academic tests – maths (mathdcm) and reading (orfpm_Eand orfpm_K), models such as C2 and D2 deviate from the 45 degree line greatly. Additionally, some measures in SDQ and DESSA such as SDQ conduct problems and DESSA empathy show systematically greater \(\Delta AICc\) values from the paper.
I continue to plot the growth curve predictions given the coefficients estimated from my reproduction and from the paper. Below is the set of plots for models that both my and paper’s results agree to be the best fits.
plot_growth_df <- read_csv(here("data","plot_growth_df.csv"))
colors <- c("my" = "black", "paper" = "red")
# "actual" = "green"
plot_growth <- function(mod_name) {
plot <-ggplot(plot_growth_df %>% filter(model_name == paste0(mod_name)),
aes(x = time_point)) +
geom_line(aes(group=1, y = my_early,color = "my", linetype = "early"), size = 0.8) +
geom_line(aes(group=1, y = my_delayed, color = "my", linetype = "delayed"), size = 0.8) +
geom_line(aes(group=1, y = paper_early, color = "paper", linetype = "early"), size = 0.8) +
geom_line(aes(group=1, y = paper_delayed, color = "paper", linetype = "delayed"), size = 0.8) +
theme_bw() +
labs(x = "Time Point",
y = "Score",
title = mod_name) +
geom_point(aes(y = my_early), color = "black")+
geom_point(aes(y = my_delayed), color = "black") +
geom_point(aes(y = paper_early), color = "red") +
geom_point(aes(y = paper_delayed), color = "red") +
scale_color_manual(values = colors)
print(plot)
}
correct_model_intercept <- plot_growth_df %>%
filter(agreement == "both")
for (i in unique(correct_model_intercept$model_name)) {
plot_growth(i)
}
These plots demonstrate that the replication results are very similar to the original study results, as black lines representing the replication findings and the red lines representing the paper’s findings are close to each other.
Figure 2. Growth curve predictions for disruptive behavior and actual mean scores.(code available in main_analysis.Rmd)
However, when different models are chosen as best models, the alignment is no longer visible. For example, the curves from the original study and reproduction are drastically different for disruptive behavior (see Figure 2). The paper chooses B2, while my analysis chooses C2 as the best model. After adding the mean scores for both conditions at every time point to the plot, however, I notice that neither of B2 and C2 are good models to predict scores of disruptive behavior. The mean scores for the intervention group (early start) shows a somewhat D2-like pattern, implying that disruptive behavior in the classroom increased for children in the SEL curriculum when the summer finished, but declined again at the end of the second year. On the other hand, the students’ disruptive behavior in the control group peaked at the end of the first year, and declined as they grew older. None of the eight models prooposed by the author could plot these patterns at the same time.
Besides these main analyses, I’ve also done descriptive analyses on the dataset. First, I compute the mean, standard deviation, sample size, missing data rate for each test at each time, separated by conditions. A direct comparison with the original paper’s statistics (in the supplemental material) demonstrates that the sample size is slightly smaller in my clean dataset, implying that the authors removed a few children due to various reasons.
data_average %>%
pivot_longer(cols = orfpm:dessa_skill,
names_to = "test_question",
values_to = "score") %>%
group_by(test_question, Condition, time) %>%
summarize(n = sum(is.na(score) == F),
mean=mean(score, na.rm = T),
sd = sd(score, na.rm = T),
missing_rate = sum(is.na(score)== T)/n()) %>%
kable(caption = "Table 1. Descriptive statistics for each tests, separated by condition and time. ") %>%
collapse_rows(columns = 1, valign = "top")
## `summarise()` has grouped output by 'test_question', 'Condition'. You can override using the `.groups` argument.
| test_question | Condition | time | n | mean | sd | missing_rate |
|---|---|---|---|---|---|---|
| aet | Delayed Start | 1 | 3268 | 83.341799 | 20.186057 | 0.2773109 |
| Delayed Start | 2 | 2999 | 80.087918 | 22.457127 | 0.3367979 | |
| Delayed Start | 3 | 2983 | 80.327348 | 21.266099 | 0.3403361 | |
| Delayed Start | 4 | 2973 | 79.986882 | 22.417906 | 0.3425475 | |
| Early Start | 1 | 3322 | 81.903198 | 20.672618 | 0.2798613 | |
| Early Start | 2 | 3079 | 79.739821 | 22.567112 | 0.3325385 | |
| Early Start | 3 | 3167 | 81.438637 | 20.833767 | 0.3134620 | |
| Early Start | 4 | 3010 | 81.154214 | 21.486995 | 0.3474962 | |
| db | Delayed Start | 1 | 3269 | 8.815753 | 14.370629 | 0.2770898 |
| Delayed Start | 2 | 2998 | 9.602048 | 16.502955 | 0.3370190 | |
| Delayed Start | 3 | 2983 | 9.280352 | 15.517967 | 0.3403361 | |
| Delayed Start | 4 | 2973 | 9.078260 | 15.819513 | 0.3425475 | |
| Early Start | 1 | 3330 | 9.531577 | 15.500510 | 0.2781270 | |
| Early Start | 2 | 3079 | 8.607429 | 14.647855 | 0.3325385 | |
| Early Start | 3 | 3160 | 9.116886 | 15.402172 | 0.3149794 | |
| Early Start | 4 | 3010 | 8.755512 | 15.304671 | 0.3474962 | |
| dessa_composite | Delayed Start | 1 | 3064 | 94.243799 | 26.136042 | 0.3224237 |
| Delayed Start | 2 | 2738 | 101.794010 | 26.069872 | 0.3945157 | |
| Delayed Start | 3 | 2803 | 99.496611 | 25.153311 | 0.3801415 | |
| Delayed Start | 4 | 2880 | 101.717361 | 25.730278 | 0.3631137 | |
| Early Start | 1 | 3321 | 96.097260 | 25.570076 | 0.2800780 | |
| Early Start | 2 | 2864 | 106.480098 | 25.097051 | 0.3791459 | |
| Early Start | 3 | 2940 | 100.432993 | 24.907968 | 0.3626707 | |
| Early Start | 4 | 2820 | 105.217021 | 25.748608 | 0.3886842 | |
| dessa_emotion | Delayed Start | 1 | 3085 | 23.474878 | 6.543759 | 0.3177797 |
| Delayed Start | 2 | 2742 | 25.247629 | 6.662086 | 0.3936311 | |
| Delayed Start | 3 | 2806 | 24.527798 | 6.373836 | 0.3794781 | |
| Delayed Start | 4 | 2882 | 25.268217 | 6.618730 | 0.3626714 | |
| Early Start | 1 | 3324 | 24.115824 | 6.347790 | 0.2794277 | |
| Early Start | 2 | 2871 | 26.604319 | 6.342559 | 0.3776284 | |
| Early Start | 3 | 2960 | 25.012500 | 6.172154 | 0.3583351 | |
| Early Start | 4 | 2834 | 26.420607 | 6.489140 | 0.3856493 | |
| dessa_empathy | Delayed Start | 1 | 3141 | 23.154091 | 7.302853 | 0.3053958 |
| Delayed Start | 2 | 2758 | 25.311095 | 6.992277 | 0.3900929 | |
| Delayed Start | 3 | 2819 | 24.826179 | 6.779997 | 0.3766033 | |
| Delayed Start | 4 | 2896 | 25.292818 | 6.940649 | 0.3595754 | |
| Early Start | 1 | 3342 | 23.600239 | 6.995229 | 0.2755257 | |
| Early Start | 2 | 2893 | 26.537850 | 6.761986 | 0.3728593 | |
| Early Start | 3 | 2999 | 24.853951 | 6.796234 | 0.3498808 | |
| Early Start | 4 | 2878 | 26.177206 | 6.847655 | 0.3761110 | |
| dessa_skill | Delayed Start | 1 | 3143 | 24.488705 | 7.331544 | 0.3049536 |
| Delayed Start | 2 | 2757 | 26.103373 | 7.313614 | 0.3903140 | |
| Delayed Start | 3 | 2823 | 25.413390 | 7.194734 | 0.3757187 | |
| Delayed Start | 4 | 2898 | 25.946860 | 7.211290 | 0.3591331 | |
| Early Start | 1 | 3360 | 24.579167 | 7.159912 | 0.2716237 | |
| Early Start | 2 | 2914 | 26.947495 | 6.886231 | 0.3683070 | |
| Early Start | 3 | 3008 | 25.500332 | 7.187775 | 0.3479298 | |
| Early Start | 4 | 2878 | 26.472203 | 7.240716 | 0.3761110 | |
| dessa_solve | Delayed Start | 1 | 3134 | 23.317167 | 6.927621 | 0.3069438 |
| Delayed Start | 2 | 2759 | 25.179413 | 6.910864 | 0.3898717 | |
| Delayed Start | 3 | 2822 | 24.710135 | 6.661430 | 0.3759398 | |
| Delayed Start | 4 | 2896 | 25.169544 | 6.816665 | 0.3595754 | |
| Early Start | 1 | 3342 | 23.766906 | 6.723500 | 0.2755257 | |
| Early Start | 2 | 2882 | 26.297710 | 6.607101 | 0.3752439 | |
| Early Start | 3 | 2988 | 25.026774 | 6.585226 | 0.3522653 | |
| Early Start | 4 | 2863 | 26.028292 | 6.868453 | 0.3793627 | |
| mathdcm | Delayed Start | 1 | 3275 | 1.554898 | 1.437160 | 0.2757629 |
| Delayed Start | 2 | 3077 | 3.173817 | 1.617941 | 0.3195489 | |
| Delayed Start | 3 | 3045 | 3.335304 | 1.465916 | 0.3266254 | |
| Delayed Start | 4 | 2921 | 4.658465 | 1.452153 | 0.3540469 | |
| Early Start | 1 | 3361 | 1.486989 | 1.454646 | 0.2714069 | |
| Early Start | 2 | 3145 | 3.123084 | 1.694985 | 0.3182311 | |
| Early Start | 3 | 3161 | 3.299510 | 1.466015 | 0.3147626 | |
| Early Start | 4 | 2983 | 4.631830 | 1.394130 | 0.3533492 | |
| orfpm | Delayed Start | 1 | 3265 | 24.303216 | 33.398587 | 0.2779743 |
| Delayed Start | 2 | 3029 | 48.397491 | 42.280435 | 0.3301636 | |
| Delayed Start | 3 | 3035 | 60.635255 | 44.578511 | 0.3288368 | |
| Delayed Start | 4 | 2934 | 83.632243 | 41.825960 | 0.3511720 | |
| Early Start | 1 | 3326 | 22.722790 | 33.739874 | 0.2789941 | |
| Early Start | 2 | 3110 | 48.604502 | 45.104566 | 0.3258183 | |
| Early Start | 3 | 3145 | 57.881399 | 45.667438 | 0.3182311 | |
| Early Start | 4 | 2971 | 82.048132 | 42.549136 | 0.3559506 | |
| sdq_conduct | Delayed Start | 1 | 3148 | 1.050508 | 1.739188 | 0.3038479 |
| Delayed Start | 2 | 2764 | 1.096961 | 1.806612 | 0.3887660 | |
| Delayed Start | 3 | 2845 | 1.018278 | 1.720139 | 0.3708536 | |
| Delayed Start | 4 | 2896 | 1.245856 | 1.966116 | 0.3595754 | |
| Early Start | 1 | 3361 | 1.066052 | 1.675039 | 0.2714069 | |
| Early Start | 2 | 2896 | 1.068370 | 1.723601 | 0.3722090 | |
| Early Start | 3 | 3012 | 1.014940 | 1.653105 | 0.3470626 | |
| Early Start | 4 | 2883 | 1.116545 | 1.818904 | 0.3750271 | |
| sdq_emotion | Delayed Start | 1 | 3148 | 1.152160 | 1.875377 | 0.3038479 |
| Delayed Start | 2 | 2765 | 1.321881 | 1.925487 | 0.3885449 | |
| Delayed Start | 3 | 2849 | 1.374868 | 1.968827 | 0.3699690 | |
| Delayed Start | 4 | 2901 | 1.574285 | 2.093277 | 0.3584697 | |
| Early Start | 1 | 3352 | 1.249403 | 1.916295 | 0.2733579 | |
| Early Start | 2 | 2906 | 1.213696 | 1.853897 | 0.3700412 | |
| Early Start | 3 | 3017 | 1.293669 | 1.904884 | 0.3459788 | |
| Early Start | 4 | 2884 | 1.362344 | 1.981562 | 0.3748103 | |
| sdq_hyper | Delayed Start | 1 | 3148 | 3.358958 | 3.121030 | 0.3038479 |
| Delayed Start | 2 | 2764 | 3.241317 | 3.125904 | 0.3887660 | |
| Delayed Start | 3 | 2849 | 3.457353 | 3.133111 | 0.3699690 | |
| Delayed Start | 4 | 2900 | 3.428965 | 3.168291 | 0.3586908 | |
| Early Start | 1 | 3370 | 3.529080 | 3.096913 | 0.2694559 | |
| Early Start | 2 | 2914 | 3.109128 | 3.002305 | 0.3683070 | |
| Early Start | 3 | 3017 | 3.458071 | 3.124778 | 0.3459788 | |
| Early Start | 4 | 2885 | 3.302253 | 3.133885 | 0.3745935 | |
| sdq_peer | Delayed Start | 1 | 3143 | 1.413618 | 1.675477 | 0.3049536 |
| Delayed Start | 2 | 2762 | 1.247647 | 1.643598 | 0.3892083 | |
| Delayed Start | 3 | 2843 | 1.292649 | 1.653959 | 0.3712959 | |
| Delayed Start | 4 | 2899 | 1.322180 | 1.710912 | 0.3589120 | |
| Early Start | 1 | 3340 | 1.457784 | 1.657843 | 0.2759592 | |
| Early Start | 2 | 2885 | 1.177816 | 1.569891 | 0.3745935 | |
| Early Start | 3 | 3007 | 1.389425 | 1.611975 | 0.3481465 | |
| Early Start | 4 | 2883 | 1.312522 | 1.637034 | 0.3750271 | |
| sdq_social | Delayed Start | 1 | 3121 | 7.012496 | 2.508735 | 0.3098187 |
| Delayed Start | 2 | 2762 | 7.657495 | 2.369979 | 0.3892083 | |
| Delayed Start | 3 | 2814 | 7.582445 | 2.358274 | 0.3777090 | |
| Delayed Start | 4 | 2893 | 7.690633 | 2.393955 | 0.3602388 | |
| Early Start | 1 | 3330 | 7.144444 | 2.483682 | 0.2781270 | |
| Early Start | 2 | 2880 | 7.739236 | 2.336057 | 0.3756774 | |
| Early Start | 3 | 3003 | 7.432234 | 2.365993 | 0.3490137 | |
| Early Start | 4 | 2865 | 7.669110 | 2.356223 | 0.3789291 |
I also replicated the correlation plot from the paper. (Note: reading score orfpm is removed from the plot because the paper used data filtered by student grade, but it is hard to do so here).
Figure 3. Correlation plot at Time 1.(code available in exploratory_analysis.Rmd )
Figure 4. Correlation plot at Time 4. (code available in exploratory_analysis.Rmd )
Table 2. Correlation plot from the paper. Upper right: Time 1; bottom left: Time 4.
The correlation coefficients are not exactly the same, but in similar ranges. DESSA measures have strong intercorrelations, but the correlations within SDQ are a little weaker. Classroom behavior measures (db, aet) and maths score (mathdcm) are only weakly correlated with other variables most of the time.
In sum, I am able to replicate 12 out of the 15 model selection decisions. In that sense, I partially reproduce the results from the paper. However, since the paper and I cannot agree on the fit of models for three outcome variables, and some \(\Delta AICc\) values vary drastically even if the best fit model converges, I suspect that there are some systematic errors underneath my analysis, but I haven’t been able to spot them. Possible reasons that lead to variations could include: 1) the fixed and random effects are not specified in the way the paper does; 2) Removing the auxiliary variables in FIML estimation could cause differences; 3) The decisions of removing outliers are unclear, and therefore we might have been using slightly different datasets.
Here are some of the things I notice and contemplate on when working on this project:
Interestingly, the authors in fact collected data from five time points! They measured everything again in Year 3. I’m interested in how results could have changed if the models are estimated with five waves, and how many more possible models could be proposed.
The above discussion connects to this concern: with so many assumptions being made on the models, do the models suffer from overfitting? If the dataset is extended to 10 years, do the score fluctuations that we capture in these models actually matter?
One question that bugs me is the implication of this research. At this point, I can say with a lot of confidence that a simple linear growth model assuming a same slope over time cannot work well. However, I find it hard to justify why the developmental trajectory of a certain feature should be estimated by, for instance, D2, but not C2. And I doubt if these results can be replicated with a different sample.
Due to time constraint, I wasn’t able to reproduce the main effects of the SEL curriculum. But looking at the paper’s conclusions, with an effect size of 0.10, I am not sure if SEL intervention is trully effective. I wonder if it could have a even longer term benefits that can’t be captured over the course of two years.
I am grateful that I have received a lot of help from the authors. I especially appreciate the long email from Keith that answered my questions with so many details.
And finally, thank you, the PSYCH 251 teaching team!! I’m probably one of the few people, if not none, who don’t have a lab or PI to rely on, and I can’t complete this project without all of your help. I want to thank Julie for the exhaustive feedback, Sarah for helping me with debugging, Effie for answering my question during OH, and Mike for those extended office hours to help me with everything!!
This project is written in R, with the help of the following packages: tidyverse, knitr, lme4, here, kableExtra, readr, foreign, AICcmodavg.