Data Analysis for SSSR conference (9th-13th July 2024)

Author

B. Brossette, M. Vernet, A. El Ahmadi, & S. Ducrot

Initializarion

#clean the environment
rm(list = ls())

#Libraries
library(readr)
library(stringr)
library(plyr)
library(dplyr)
library(tidyverse)
library(interactions)
library(corrplot)
library(Hmisc)
library(kableExtra)
library(car)
library(lavaan)
library(semPlot)

#data upload
df <- read_csv("../src/paper-draft/2024-06-26_dataset_reduc.csv", 
    col_types = cols(...1 = col_skip()))


#code sex as numeric (M=1, F=2)
df$sexN <- as.numeric(factor(df$sex, levels = c("M", "F")))

1. Material and Methods

1.1. Participants

One hundred and ninety-eight children from public elementary schools were tested at theend of their school year in May: 34 were 1st Graders, 37 were 2nd Graders, 30 were 3rd Gradeers, 27 were 4th Graders and 59 were 5thGraders. All participants were native French speakers with normal or corrected-to-normal vision. Informed consent was provided by the participant’s caregivers prior to experimentation.Ethical approval for this study was granted by the Ethics Committee of Aix-Marseille University. The experiment was performed in accordance with relevant guidelinesand regulations and in accordance with the Declaration of Helsink.

1.2. Assessments

1.2.1. Leisure reading time

Participant’s caregivers completed a large paper and pencil survey on their home literacy habits in which they had to report the time spent weekly by their child reading during their leisure time.

1.2.2. Visuo-attentional abilities

Visual and oculomotor skills were assessed using the DEM-test (Developmental Eye Movement Test ; Garzia et al., 1990 ; Richman, 2009), which is composed of horizontal and vertical digit reading tasks printed on four different sheets of paper: the pre-test (a horizontal line of ten 0.5 cm high digits) to ensure that the child is familiar with all the digits presented, two vertical tests (Test A and B; each composed of two vertical lines of twenty 0.5 cm high digits separated by a 10.5 cm horizontal margin and with a 0.5 cm vertical distance between letters), which are supposed to evaluate the level of automaticity of the naming of numbers with involving only basic oculomotor skills, and one horizontal test (Test C; sixteen lines of five irregularly separated 0.5 cm high digits), that requires good oculomotor and visual-attentional skills, in addition to number naming skills, due to the variable lines spacing and numerous line references making the ocular tracking of lines more difficult. Children were asked to read aloud the digits as fast and as accurately as possible. The task was timed, and errors or omissions were recorded. The DEM test provides four main indices: (1) the vertical reading time (VT; in seconds), which represents the sum of the time spent on naming the eighty vertically organized digits of Test A and B (in accordance with the test manual, errors were not used for the scoring purpose); (2) the adjusted horizontal time (HT; in seconds) which represents the time required for reading the eighty horizontally organized digits presented in Test C; this score is corrected for omission or addition errors; (3) the total number of errors (HE) which gives the accuracy on the execution of Test C corresponding to the sum of the four possible types of errors: addition errors (i.e., adding or repeating a digit), omission errors (i.e., forgetting a digit), transposition errors (i.e., inversion between two digits) and substitution errors (i.e., replacing one digit with another); and (4) the Ration score (R) which is calculated dividing the HT by the VT. This ratio is intended to isolate the involvement of oculomotor processes in the horizontal naming task.

1.2.3. Reading Fluency

Reading scores (CTL indicator) were obtained using the Alouette test (Lefavrais, 1965), a standardized French reading level assessment. Participants must read 265 words as rapidly and accurately as possible within a 3-minute time limit. The text includes rare words, and some spelling traps, and the characteristics of this test prevent readers from compensating for their written word recognition difficulties via contextual guessing. We used the CTL score, which considers both correctness and speed (CTL = (C x180)/TL with C = the number of words correctly read, and TL = the reading time; max: 180 sec.).

1.2.4. Summary of data by grade.

sum_by_grade

sum_by_grade <- df %>%
  group_by(grade) %>%
  summarise(
    age_mean = mean(age, na.rm = TRUE),
    age_sd = sd(age, na.rm = TRUE),
    age_min = min(age, na.rm = TRUE),
    age_max = max(age, na.rm = TRUE),
    
    DEM_VT_mean = mean(DEM_VT, na.rm = TRUE),
    DEM_VT_sd = sd(DEM_VT, na.rm = TRUE),
    DEM_VT_min = min(DEM_VT, na.rm = TRUE),
    DEM_VT_max = max(DEM_VT, na.rm = TRUE),
    
    DEM_HT_mean = mean(DEM_HT, na.rm = TRUE),
    DEM_HT_sd = sd(DEM_HT, na.rm = TRUE),
    DEM_HT_min = min(DEM_HT, na.rm = TRUE),
    DEM_HT_max = max(DEM_HT, na.rm = TRUE),
    
    DEM_VT_mean = mean(DEM_VT, na.rm = TRUE),
    DEM_VT_sd = sd(DEM_VT, na.rm = TRUE),
    DEM_VT_min = min(DEM_VT, na.rm = TRUE),
    DEM_VT_max = max(DEM_VT, na.rm = TRUE),
    
    AL_CTL_mean = mean(AL_CTL, na.rm = TRUE),
    AL_CTL_sd = sd(AL_CTL, na.rm = TRUE),
    AL_CTL_min = min(AL_CTL, na.rm = TRUE),
    AL_CTL_max = max(AL_CTL, na.rm = TRUE),
    
    LPLA_P_time_mean = mean(LPLA_P_time, na.rm = TRUE),
    LPLA_P_time_sd = sd(LPLA_P_time, na.rm = TRUE),
    LPLA_P_time_min = min(LPLA_P_time, na.rm = TRUE),
    LPLA_P_time_max = max(LPLA_P_time, na.rm = TRUE)
  )

sum_by_grade %>%
  kable("html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

grade	age_mean	age_sd	age_min	age_max	DEM_VT_mean	DEM_VT_sd	DEM_VT_min	DEM_VT_max	DEM_HT_mean	DEM_HT_sd	DEM_HT_min	DEM_HT_max	AL_CTL_mean	AL_CTL_sd	AL_CTL_min	AL_CTL_max	LPLA_P_time_mean	LPLA_P_time_sd	LPLA_P_time_max
1	82.44118	4.061580	73	87	64.88853	22.390420	27.00	147.00	131.93090	55.29891	53.75758	291.20000	76.58824	42.02966	16	172.0000	99.55882	88.61192	420
2	93.86486	3.198630	88	99	52.20189	8.872080	39.00	72.00	83.85273	25.25050	52.26667	156.80000	172.25591	63.90618	80	372.7434	146.62162	149.33675	720
3	104.10000	3.220088	95	110	46.40800	6.678051	33.19	61.00	66.13164	12.95736	45.00000	105.60000	223.89502	81.80634	97	440.7477	147.63333	198.55278	900
4	115.96296	3.081976	110	122	42.41222	7.721486	28.77	58.00	58.91657	12.54279	40.00000	83.07429	283.11208	76.58591	124	463.3663	188.33333	194.88162	720
5	128.57143	4.322410	118	143	41.14314	7.400423	27.00	62.23	52.14564	11.51940	30.00000	87.50000	302.16995	92.99172	93	502.1053	179.78571	173.72191	900

2. Results

2.1. Correlation Matrix

cor

#selecting relevant variables
mat.df <- df[,-c(1,2,3,5,6,7)]

cor.mat <- rcorr(as.matrix(mat.df))

#compute matrices
mat <- cor.mat$r
pmat <- cor.mat$P

#Replace the NA of the diagonal
pmat[is.na(pmat)] <- 0

#display the matrix
corrplot(mat, method = "color", 
         addCoef.col = "black", number.cex = 0.7, tl.cex = 0.8, 
         tl.col = "black", cl.pos = "b", cl.cex = 0.8, type="lower",
         title = "Correlation Matrix with Significant Correlations Marked", mar = c(0, 0, 1, 0),
         p.mat = pmat, sig.level = 0.05, insig = "pch", pch.col = "black", pch.cex = 1.5)

2.2. Assessing multicollinearity

The Variance Inflation Factor (VIF) is a measure used to detect the presence and severity of multicollinearity in regression analysis. Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, which can cause issues with the interpretation and reliability of the model’s coefficients. The VIF index can be interpreted as follow :

VIF = 1: No correlation between the predictor variable and the other predictor variables.
1 < VIF < 5: Moderate correlation but generally considered acceptable
VIF > 5: High correlation, indicating that multicollinearity might be problematic. Some practitioners use a threshold of 10 to indicate serious multicollinearity.
VIF > 10: Strong multicollinearity, which could severely affect the estimation of regression coefficients.

No multicollinearity issues have been detected.

vif-values

# Fit a linear regression model
model.1 <- lm(AL_CTL ~ DEM_HT + DEM_VT + LPLA_P_time + age, data=df)

# Calculate VIF values
vif_values <- vif(model.1)

# Print VIF values
print(vif_values)

     DEM_HT      DEM_VT LPLA_P_time         age 
   2.285301    1.933917    1.029018    1.799007

2.3. Our conceptual model

The indirect effect of visual-attentional abilities (through HT and VT indices) was tested using Structured Equation Modelling (SEM) in the R Studio Environment. SEM concurrently models all paths, giving a more powerful, accurate, and robust estimation of mediation effects than more traditional tests based on sequential regressions, primarily when more than one mediator is implemented in the model. All of the relationships among variables in the model are tested together, and all paths can be compared with each other in terms of each variable’s degree of importance. Models were fit using the 5000 bootstrap technique. As no golden rule exists to assess model fit, reporting a variety of indexes is recommended to reflect different aspects of model fit.

First, we tested our conceptual model, which contains a mediation effect from leisure time reading to reading fluency through both measures of visual-attentional abilities (HT and VT indices). Moreover, as “age” was significantly correlated with reading fluency and HT and VT indices, we controlled these measures for the effect of age. Finally, as HT and VT indices shared (theoretically and empirically) common variance, we set a partial correlation between these two variables.

2.3.1. Paths graph

conceptual-model

model_diagram <- "
digraph SEM {
  node [shape = rectangle, style = filled, color = lightblue];

  # Variables
  LPLA_P_time [label = 'LPLA_P_time'];
  DEM_HT [label = 'DEM_HT'];
  DEM_VT [label = 'DEM_VT'];
  AL_CTL [label = 'AL_CTL'];

  # Relations directes
  LPLA_P_time -> AL_CTL [label = 'b1'];

  # Médiations
  LPLA_P_time -> DEM_HT [label = 'a1'];
  DEM_HT -> AL_CTL [label = 'b2'];
  LPLA_P_time -> DEM_VT [label = 'a2'];
  DEM_VT -> AL_CTL [label = 'b3'];
}
"

# Afficher le modèle avec DiagrammeR
DiagrammeR::grViz(model_diagram)

2.3.2. Implementation

The conceptual model was designed to explore the relationship between leisure reading time (LPLA_P_time), visual and attentional skills (HT and VT indices), and reading fluency (AL_CTL). Our initial model hypothesized direct and mediated effects of leisure reading time on reading fluency through visual and attentional skills.

As expected, we found the following direct effects :

Leisure reading time (LPLA_P_time) positively influenced reading fluency (AL_CTL) and DEM_HT. However Leisure reading time was not significantly related to DEM_VT.
Visual attentional skills (DEM_HT and DEM_VT) positively influenced reading fluency (AL_CTL).
Age significantly influenced DEM_HT, DEM_VT, and AL_CTL.

However, both mediation effects (though VT and HT) were non-significant.

Despite the fact that our conceptual model exhibited excellent fit indices, the examination of degrees of freedom indicate that our model overfit. Overfitting must be avoided because it leads to several issues, such as poor generalization to new data, increased model complexity, misleading inferences, wasted resources, and reduced model robustness, ultimately compromising the reliability and validity of the findings.

conceptual-model-implementation

# Define the model
model.2 <- '
  # Effet direct de LPLA sur CTL
  AL_CTL ~ b1*LPLA_P_time

  # Médiation de LPLA sur CTL par DEM_HT
  DEM_HT ~ a1*LPLA_P_time
  AL_CTL ~ b2*DEM_HT

  # Médiation de LPLA sur CTL par DEM_VT
  DEM_VT ~ a2*LPLA_P_time
  AL_CTL ~ b3*DEM_VT

  # Corrélation partielle entre DEM_HT et DEM_VT
  DEM_HT ~~ c3*DEM_VT
  
  #Direct effect of age on CTL, DEM_HT, and DEM_VT
  AL_CTL ~ d1*age
  DEM_HT ~ d2*age
  DEM_VT ~ d3*age

  # Calcul des effets indirects
  indirect_DEM_HT := a1*b2
  indirect_DEM_VT := a2*b3

  # Calcul des effets totaux
  total := b1 + (a1*b2) + (a2*b3)
'

# Ajustement du modèle aux données
fit.2 <- sem(model.2, data = df, se = "bootstrap", bootstrap = 5000)

# Résumé des résultats
summary(fit.2, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)

lavaan 0.6-18 ended normally after 24 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        12

  Number of observations                           198

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Model Test Baseline Model:

  Test statistic                               443.508
  Degrees of freedom                                 9
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2808.541
  Loglikelihood unrestricted model (H1)      -2808.541
                                                      
  Akaike (AIC)                                5641.082
  Bayesian (BIC)                              5680.541
  Sample-size adjusted Bayesian (SABIC)       5642.525

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value H_0: RMSEA <= 0.050                       NA
  P-value H_0: RMSEA >= 0.080                       NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             5000
  Number of successful bootstrap draws            5000

Regressions:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
  AL_CTL ~                                                                
    LPLA_P_tm (b1)     0.188    0.035    5.427    0.000     0.188    0.278
  DEM_HT ~                                                                
    LPLA_P_tm (a1)    -0.021    0.010   -2.243    0.025    -0.021   -0.091
  AL_CTL ~                                                                
    DEM_HT    (b2)    -0.591    0.185   -3.199    0.001    -0.591   -0.205
  DEM_VT ~                                                                
    LPLA_P_tm (a2)    -0.004    0.004   -0.928    0.353    -0.004   -0.046
  AL_CTL ~                                                                
    DEM_VT    (b3)    -1.902    0.649   -2.932    0.003    -1.902   -0.242
    age       (d1)     2.340    0.421    5.559    0.000     2.340    0.374
  DEM_HT ~                                                                
    age       (d2)    -1.372    0.155   -8.859    0.000    -1.372   -0.633
  DEM_VT ~                                                                
    age       (d3)    -0.442    0.062   -7.162    0.000    -0.442   -0.556

Covariances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
 .DEM_HT ~~                                                               
   .DEM_VT    (c3)   172.478   54.459    3.167    0.002   172.478    0.492

Variances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .AL_CTL          4516.097  468.532    9.639    0.000  4516.097    0.356
   .DEM_HT           877.995  190.819    4.601    0.000   877.995    0.578
   .DEM_VT           139.775   39.338    3.553    0.000   139.775    0.683

R-Square:
                   Estimate 
    AL_CTL             0.644
    DEM_HT             0.422
    DEM_VT             0.317

Defined Parameters:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
    indirct_DEM_HT     0.013    0.007    1.861    0.063     0.013    0.019
    indirct_DEM_VT     0.008    0.009    0.809    0.418     0.008    0.011
    total              0.208    0.038    5.533    0.000     0.208    0.308

conceptual-model-implementation

# Tracé du modèle
semPaths(fit.2, "std", layout = "tree", 
         whatLabels = "std", 
         edge.label.cex = 1, 
         node.label.cex = 1, 
         sizeMan = 10, 
         sizeLat = 10, 
         intercepts = FALSE)

Warning in qgraph::qgraph(Edgelist, labels = nLab, bidirectional = Bidir, : The
following arguments are not documented and likely not arguments of qgraph and
thus ignored: node.label.cex

2.4. Our revised model

As our primary theoretical interest is the potential mediation effect between leisure time reading and reading fluency through HT indices, we decided to remove the path from leisure time reading and DEM_VT, which was non-significant in the previous analysis. Thus, we solve the over-fitting issue.

We found the following direct effects :

Leisure reading time (LPLA_P_time) positively influenced reading fluency (AL_CTL) and DEM_HT.
Visual attentional skills (DEM_HT and DEM_VT) positively influenced reading fluency (AL_CTL).
Age significantly influenced DEM_HT, DEM_VT, and AL_CTL.

However, the mediation effect between leisure reading time and reading fluency through DEM_HT was still non-significant.

The revised model exhibited excellent fit indices, indicating a robust fit to the data while controlling for over-fitting issues. However, it should be noted that the upper bound of the RMSEA confidence interval can indicate potential misspecification if it exceeds acceptable thresholds (typically 0.08, as is the case here). An RMSEA of 0.000 with an upper confidence limit higher than 0.05 can signal that the model might not capture all relevant aspects of the data. Moreover, although the model explains a substantial portion of the variance in reading fluency, there remains an unexplained variance that might be accounted for by other factors not included in the model.

2.4.1. Paths graph

revised-model

model_diagram_revised <- "
digraph SEM {
  node [shape = rectangle, style = filled, color = lightblue];

  # Variables
  LPLA_P_time [label = 'LPLA_P_time'];
  DEM_HT [label = 'DEM_HT'];
  DEM_VT [label = 'DEM_VT'];
  AL_CTL [label = 'AL_CTL'];

  # Relations directes
  LPLA_P_time -> AL_CTL [label = 'b1'];
  DEM_VT -> AL_CTL [label = 'b3'];

  # Médiations
  LPLA_P_time -> DEM_HT [label = 'a1'];
  DEM_HT -> AL_CTL [label = 'b2'];
}
"

# Afficher le modèle avec DiagrammeR
DiagrammeR::grViz(model_diagram_revised)

2.4.2. Implementation

revised-model-implementation

# Define the model
model.3 <- '
  # Effet direct de LPLA sur CTL
  AL_CTL ~ b1*LPLA_P_time

  # Médiation de LPLA sur CTL par DEM_HT
  DEM_HT ~ a1*LPLA_P_time
  AL_CTL ~ b2*DEM_HT

  # Effet direct de DEM_VT sur CTL
  AL_CTL ~ b3*DEM_VT

  # Corrélation partielle entre DEM_HT et DEM_VT
  DEM_HT ~~ c3*DEM_VT
  
  #Direct effect of age on CTL, DEM_HT, and DEM_VT
  AL_CTL ~ d1*age
  DEM_HT ~ d2*age
  DEM_VT ~ d3*age

  # Calcul des effets indirects
  indirect_DEM_HT := a1*b2

  # Calcul des effets totaux
  total := b1 + (a1*b2)
'

# Ajustement du modèle aux données
fit.3 <- sem(model.3, data = df, se = "bootstrap", bootstrap = 5000)

# Résumé des résultats
summary(fit.3, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)

lavaan 0.6-18 ended normally after 22 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        11

  Number of observations                           198

Model Test User Model:
                                                      
  Test statistic                                 0.609
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.435

Model Test Baseline Model:

  Test statistic                               443.508
  Degrees of freedom                                 9
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.008

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2808.845
  Loglikelihood unrestricted model (H1)      -2808.541
                                                      
  Akaike (AIC)                                5639.690
  Bayesian (BIC)                              5675.861
  Sample-size adjusted Bayesian (SABIC)       5641.013

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.172
  P-value H_0: RMSEA <= 0.050                    0.538
  P-value H_0: RMSEA >= 0.080                    0.337

Standardized Root Mean Square Residual:

  SRMR                                           0.014

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             5000
  Number of successful bootstrap draws            5000

Regressions:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
  AL_CTL ~                                                                
    LPLA_P_tm (b1)     0.188    0.035    5.397    0.000     0.188    0.279
  DEM_HT ~                                                                
    LPLA_P_tm (a1)    -0.016    0.008   -2.174    0.030    -0.016   -0.070
  AL_CTL ~                                                                
    DEM_HT    (b2)    -0.591    0.190   -3.109    0.002    -0.591   -0.205
    DEM_VT    (b3)    -1.902    0.642   -2.961    0.003    -1.902   -0.243
    age       (d1)     2.340    0.427    5.483    0.000     2.340    0.376
  DEM_HT ~                                                                
    age       (d2)    -1.377    0.160   -8.584    0.000    -1.377   -0.636
  DEM_VT ~                                                                
    age       (d3)    -0.447    0.063   -7.080    0.000    -0.447   -0.562

Covariances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
 .DEM_HT ~~                                                               
   .DEM_VT    (c3)   173.009   55.308    3.128    0.002   173.009    0.493

Variances:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
   .AL_CTL          4516.097  473.462    9.538    0.000  4516.097    0.360
   .DEM_HT           878.651  196.357    4.475    0.000   878.651    0.580
   .DEM_VT           140.205   39.903    3.514    0.000   140.205    0.685

R-Square:
                   Estimate 
    AL_CTL             0.640
    DEM_HT             0.420
    DEM_VT             0.315

Defined Parameters:
                   Estimate   Std.Err  z-value  P(>|z|)   Std.lv   Std.all
    indirct_DEM_HT     0.010    0.006    1.732    0.083     0.010    0.014
    total              0.198    0.036    5.446    0.000     0.198    0.293

revised-model-implementation

# Tracé du modèle
semPaths(fit.3, "std", layout = "tree", 
         whatLabels = "std", 
         edge.label.cex = 1, 
         node.label.cex = 1, 
         sizeMan = 10, 
         sizeLat = 10, 
         intercepts = FALSE)

Warning in qgraph::qgraph(Edgelist, labels = nLab, bidirectional = Bidir, : The
following arguments are not documented and likely not arguments of qgraph and
thus ignored: node.label.cex

2.5. Comparison of the conceptual and the revised model

model-comparison

anova(fit.2, fit.3)


Chi-Squared Difference Test

      Df    AIC    BIC  Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit.2  0 5641.1 5680.5 0.0000                                    
fit.3  1 5639.7 5675.9 0.6087    0.60873     0       1     0.4353