Data Analysis for SSSR conference (9th-13th July 2024)
Author
B. Brossette, M. Vernet, A. El Ahmadi, & S. Ducrot
Initializarion
#clean the environmentrm(list =ls())#Librarieslibrary(readr)library(stringr)library(plyr)library(dplyr)library(tidyverse)library(interactions)library(corrplot)library(Hmisc)library(kableExtra)library(car)library(lavaan)library(semPlot)#data uploaddf <-read_csv("../src/paper-draft/2024-06-26_dataset_reduc.csv", col_types =cols(...1 =col_skip()))#code sex as numeric (M=1, F=2)df$sexN <-as.numeric(factor(df$sex, levels =c("M", "F")))
1. Material and Methods
1.1. Participants
One hundred and ninety-eight children from public elementary schools were tested at theend of their school year in May: 34 were 1st Graders, 37 were 2nd Graders, 30 were 3rd Gradeers, 27 were 4th Graders and 59 were 5thGraders. All participants were native French speakers with normal or corrected-to-normal vision. Informed consent was provided by the participant’s caregivers prior to experimentation.Ethical approval for this study was granted by the Ethics Committee of Aix-Marseille University. The experiment was performed in accordance with relevant guidelinesand regulations and in accordance with the Declaration of Helsink.
1.2. Assessments
1.2.1. Leisure reading time
Participant’s caregivers completed a large paper and pencil survey on their home literacy habits in which they had to report the time spent weekly by their child reading during their leisure time.
1.2.2. Visuo-attentional abilities
Visual and oculomotor skills were assessed using the DEM-test (Developmental Eye Movement Test ; Garzia et al., 1990 ; Richman, 2009), which is composed of horizontal and vertical digit reading tasks printed on four different sheets of paper: the pre-test (a horizontal line of ten 0.5 cm high digits) to ensure that the child is familiar with all the digits presented, two vertical tests (Test A and B; each composed of two vertical lines of twenty 0.5 cm high digits separated by a 10.5 cm horizontal margin and with a 0.5 cm vertical distance between letters), which are supposed to evaluate the level of automaticity of the naming of numbers with involving only basic oculomotor skills, and one horizontal test (Test C; sixteen lines of five irregularly separated 0.5 cm high digits), that requires good oculomotor and visual-attentional skills, in addition to number naming skills, due to the variable lines spacing and numerous line references making the ocular tracking of lines more difficult. Children were asked to read aloud the digits as fast and as accurately as possible. The task was timed, and errors or omissions were recorded. The DEM test provides four main indices: (1) the vertical reading time (VT; in seconds), which represents the sum of the time spent on naming the eighty vertically organized digits of Test A and B (in accordance with the test manual, errors were not used for the scoring purpose); (2) the adjusted horizontal time (HT; in seconds) which represents the time required for reading the eighty horizontally organized digits presented in Test C; this score is corrected for omission or addition errors; (3) the total number of errors (HE) which gives the accuracy on the execution of Test C corresponding to the sum of the four possible types of errors: addition errors (i.e., adding or repeating a digit), omission errors (i.e., forgetting a digit), transposition errors (i.e., inversion between two digits) and substitution errors (i.e., replacing one digit with another); and (4) the Ration score (R) which is calculated dividing the HT by the VT. This ratio is intended to isolate the involvement of oculomotor processes in the horizontal naming task.
1.2.3. Reading Fluency
Reading scores (CTL indicator) were obtained using the Alouette test (Lefavrais, 1965), a standardized French reading level assessment. Participants must read 265 words as rapidly and accurately as possible within a 3-minute time limit. The text includes rare words, and some spelling traps, and the characteristics of this test prevent readers from compensating for their written word recognition difficulties via contextual guessing. We used the CTL score, which considers both correctness and speed (CTL = (C x180)/TL with C = the number of words correctly read, and TL = the reading time; max: 180 sec.).
#selecting relevant variablesmat.df <- df[,-c(1,2,3,5,6,7)]cor.mat <-rcorr(as.matrix(mat.df))#compute matricesmat <- cor.mat$rpmat <- cor.mat$P#Replace the NA of the diagonalpmat[is.na(pmat)] <-0#display the matrixcorrplot(mat, method ="color", addCoef.col ="black", number.cex =0.7, tl.cex =0.8, tl.col ="black", cl.pos ="b", cl.cex =0.8, type="lower",title ="Correlation Matrix with Significant Correlations Marked", mar =c(0, 0, 1, 0),p.mat = pmat, sig.level =0.05, insig ="pch", pch.col ="black", pch.cex =1.5)
2.2. Assessing multicollinearity
The Variance Inflation Factor (VIF) is a measure used to detect the presence and severity of multicollinearity in regression analysis. Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, which can cause issues with the interpretation and reliability of the model’s coefficients. The VIF index can be interpreted as follow :
VIF = 1: No correlation between the predictor variable and the other predictor variables.
1 < VIF < 5: Moderate correlation but generally considered acceptable
VIF > 5: High correlation, indicating that multicollinearity might be problematic. Some practitioners use a threshold of 10 to indicate serious multicollinearity.
VIF > 10: Strong multicollinearity, which could severely affect the estimation of regression coefficients.
No multicollinearity issues have been detected.
vif-values
# Fit a linear regression modelmodel.1<-lm(AL_CTL ~ DEM_HT + DEM_VT + LPLA_P_time + age, data=df)# Calculate VIF valuesvif_values <-vif(model.1)# Print VIF valuesprint(vif_values)
DEM_HT DEM_VT LPLA_P_time age
2.285301 1.933917 1.029018 1.799007
2.3. Our conceptual model
The indirect effect of visual-attentional abilities (through HT and VT indices) was tested using Structured Equation Modelling (SEM) in the R Studio Environment. SEM concurrently models all paths, giving a more powerful, accurate, and robust estimation of mediation effects than more traditional tests based on sequential regressions, primarily when more than one mediator is implemented in the model. All of the relationships among variables in the model are tested together, and all paths can be compared with each other in terms of each variable’s degree of importance. Models were fit using the 5000 bootstrap technique. As no golden rule exists to assess model fit, reporting a variety of indexes is recommended to reflect different aspects of model fit.
First, we tested our conceptual model, which contains a mediation effect from leisure time reading to reading fluency through both measures of visual-attentional abilities (HT and VT indices). Moreover, as “age” was significantly correlated with reading fluency and HT and VT indices, we controlled these measures for the effect of age. Finally, as HT and VT indices shared (theoretically and empirically) common variance, we set a partial correlation between these two variables.
The conceptual model was designed to explore the relationship between leisure reading time (LPLA_P_time), visual and attentional skills (HT and VT indices), and reading fluency (AL_CTL). Our initial model hypothesized direct and mediated effects of leisure reading time on reading fluency through visual and attentional skills.
As expected, we found the following direct effects :
Leisure reading time (LPLA_P_time) positively influenced reading fluency (AL_CTL) and DEM_HT. However Leisure reading time was not significantly related to DEM_VT.
Age significantly influenced DEM_HT, DEM_VT, and AL_CTL.
However, both mediation effects (though VT and HT) were non-significant.
Despite the fact that our conceptual model exhibited excellent fit indices, the examination of degrees of freedom indicate that our model overfit. Overfitting must be avoided because it leads to several issues, such as poor generalization to new data, increased model complexity, misleading inferences, wasted resources, and reduced model robustness, ultimately compromising the reliability and validity of the findings.
conceptual-model-implementation
# Define the modelmodel.2<-' # Effet direct de LPLA sur CTL AL_CTL ~ b1*LPLA_P_time # Médiation de LPLA sur CTL par DEM_HT DEM_HT ~ a1*LPLA_P_time AL_CTL ~ b2*DEM_HT # Médiation de LPLA sur CTL par DEM_VT DEM_VT ~ a2*LPLA_P_time AL_CTL ~ b3*DEM_VT # Corrélation partielle entre DEM_HT et DEM_VT DEM_HT ~~ c3*DEM_VT #Direct effect of age on CTL, DEM_HT, and DEM_VT AL_CTL ~ d1*age DEM_HT ~ d2*age DEM_VT ~ d3*age # Calcul des effets indirects indirect_DEM_HT := a1*b2 indirect_DEM_VT := a2*b3 # Calcul des effets totaux total := b1 + (a1*b2) + (a2*b3)'# Ajustement du modèle aux donnéesfit.2<-sem(model.2, data = df, se ="bootstrap", bootstrap =5000)# Résumé des résultatssummary(fit.2, fit.measures =TRUE, standardized =TRUE, rsquare =TRUE)
lavaan 0.6-18 ended normally after 24 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 12
Number of observations 198
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 443.508
Degrees of freedom 9
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -2808.541
Loglikelihood unrestricted model (H1) -2808.541
Akaike (AIC) 5641.082
Bayesian (BIC) 5680.541
Sample-size adjusted Bayesian (SABIC) 5642.525
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 5000
Number of successful bootstrap draws 5000
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
AL_CTL ~
LPLA_P_tm (b1) 0.188 0.035 5.427 0.000 0.188 0.278
DEM_HT ~
LPLA_P_tm (a1) -0.021 0.010 -2.243 0.025 -0.021 -0.091
AL_CTL ~
DEM_HT (b2) -0.591 0.185 -3.199 0.001 -0.591 -0.205
DEM_VT ~
LPLA_P_tm (a2) -0.004 0.004 -0.928 0.353 -0.004 -0.046
AL_CTL ~
DEM_VT (b3) -1.902 0.649 -2.932 0.003 -1.902 -0.242
age (d1) 2.340 0.421 5.559 0.000 2.340 0.374
DEM_HT ~
age (d2) -1.372 0.155 -8.859 0.000 -1.372 -0.633
DEM_VT ~
age (d3) -0.442 0.062 -7.162 0.000 -0.442 -0.556
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.DEM_HT ~~
.DEM_VT (c3) 172.478 54.459 3.167 0.002 172.478 0.492
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.AL_CTL 4516.097 468.532 9.639 0.000 4516.097 0.356
.DEM_HT 877.995 190.819 4.601 0.000 877.995 0.578
.DEM_VT 139.775 39.338 3.553 0.000 139.775 0.683
R-Square:
Estimate
AL_CTL 0.644
DEM_HT 0.422
DEM_VT 0.317
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirct_DEM_HT 0.013 0.007 1.861 0.063 0.013 0.019
indirct_DEM_VT 0.008 0.009 0.809 0.418 0.008 0.011
total 0.208 0.038 5.533 0.000 0.208 0.308
Warning in qgraph::qgraph(Edgelist, labels = nLab, bidirectional = Bidir, : The
following arguments are not documented and likely not arguments of qgraph and
thus ignored: node.label.cex
2.4. Our revised model
As our primary theoretical interest is the potential mediation effect between leisure time reading and reading fluency through HT indices, we decided to remove the path from leisure time reading and DEM_VT, which was non-significant in the previous analysis. Thus, we solve the over-fitting issue.
We found the following direct effects :
Leisure reading time (LPLA_P_time) positively influenced reading fluency (AL_CTL) and DEM_HT.
Age significantly influenced DEM_HT, DEM_VT, and AL_CTL.
However, the mediation effect between leisure reading time and reading fluency through DEM_HT was still non-significant.
The revised model exhibited excellent fit indices, indicating a robust fit to the data while controlling for over-fitting issues. However, it should be noted that the upper bound of the RMSEA confidence interval can indicate potential misspecification if it exceeds acceptable thresholds (typically 0.08, as is the case here). An RMSEA of 0.000 with an upper confidence limit higher than 0.05 can signal that the model might not capture all relevant aspects of the data. Moreover, although the model explains a substantial portion of the variance in reading fluency, there remains an unexplained variance that might be accounted for by other factors not included in the model.
# Define the modelmodel.3<-' # Effet direct de LPLA sur CTL AL_CTL ~ b1*LPLA_P_time # Médiation de LPLA sur CTL par DEM_HT DEM_HT ~ a1*LPLA_P_time AL_CTL ~ b2*DEM_HT # Effet direct de DEM_VT sur CTL AL_CTL ~ b3*DEM_VT # Corrélation partielle entre DEM_HT et DEM_VT DEM_HT ~~ c3*DEM_VT #Direct effect of age on CTL, DEM_HT, and DEM_VT AL_CTL ~ d1*age DEM_HT ~ d2*age DEM_VT ~ d3*age # Calcul des effets indirects indirect_DEM_HT := a1*b2 # Calcul des effets totaux total := b1 + (a1*b2)'# Ajustement du modèle aux donnéesfit.3<-sem(model.3, data = df, se ="bootstrap", bootstrap =5000)# Résumé des résultatssummary(fit.3, fit.measures =TRUE, standardized =TRUE, rsquare =TRUE)
lavaan 0.6-18 ended normally after 22 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 11
Number of observations 198
Model Test User Model:
Test statistic 0.609
Degrees of freedom 1
P-value (Chi-square) 0.435
Model Test Baseline Model:
Test statistic 443.508
Degrees of freedom 9
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.008
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -2808.845
Loglikelihood unrestricted model (H1) -2808.541
Akaike (AIC) 5639.690
Bayesian (BIC) 5675.861
Sample-size adjusted Bayesian (SABIC) 5641.013
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.172
P-value H_0: RMSEA <= 0.050 0.538
P-value H_0: RMSEA >= 0.080 0.337
Standardized Root Mean Square Residual:
SRMR 0.014
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 5000
Number of successful bootstrap draws 5000
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
AL_CTL ~
LPLA_P_tm (b1) 0.188 0.035 5.397 0.000 0.188 0.279
DEM_HT ~
LPLA_P_tm (a1) -0.016 0.008 -2.174 0.030 -0.016 -0.070
AL_CTL ~
DEM_HT (b2) -0.591 0.190 -3.109 0.002 -0.591 -0.205
DEM_VT (b3) -1.902 0.642 -2.961 0.003 -1.902 -0.243
age (d1) 2.340 0.427 5.483 0.000 2.340 0.376
DEM_HT ~
age (d2) -1.377 0.160 -8.584 0.000 -1.377 -0.636
DEM_VT ~
age (d3) -0.447 0.063 -7.080 0.000 -0.447 -0.562
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.DEM_HT ~~
.DEM_VT (c3) 173.009 55.308 3.128 0.002 173.009 0.493
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.AL_CTL 4516.097 473.462 9.538 0.000 4516.097 0.360
.DEM_HT 878.651 196.357 4.475 0.000 878.651 0.580
.DEM_VT 140.205 39.903 3.514 0.000 140.205 0.685
R-Square:
Estimate
AL_CTL 0.640
DEM_HT 0.420
DEM_VT 0.315
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirct_DEM_HT 0.010 0.006 1.732 0.083 0.010 0.014
total 0.198 0.036 5.446 0.000 0.198 0.293
Warning in qgraph::qgraph(Edgelist, labels = nLab, bidirectional = Bidir, : The
following arguments are not documented and likely not arguments of qgraph and
thus ignored: node.label.cex
2.5. Comparison of the conceptual and the revised model