Background

A group of researchers at the University of Sheffield, self-named “Practitioners”, investigate how people develop skills and knowledge throughout their lifetime. They are interested in understanding skill acquisition and focus on collecting and analysing data that can reveal underlying factors that influence this process. In a set of studies, Practitioners focused on the acquisition of language skills in toddlers. In particular, they investigated how language exposure impacts later linguistic skills, cognitive abilities, and academic achievement. In this knowledge assessment, you are part of the Practitioners team. The goal is to make several models, which quantify and test postulated theoretical assumptions. Previous studies in language acquisition showed that language skills depend on the richness of the environment as well as number of other factors. In the case of this exam, we will focus on the question of how people learn language and whether this influences other outcomes, such as university enrollment.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggplot2)
library(tidyr)
require('tidySEM')
require(lavaan)
library(statmod)
library(semPlot)
# Checking summary of dataframes
summary(LexicalDec)
       ID          item_id             WordType     MotherVocab    
 1      :  50   1      : 100   cognates    :2500   Min.   : 63.49  
 2      :  50   2      : 100   non-cognates:2500   1st Qu.: 88.28  
 3      :  50   3      : 100                       Median :100.55  
 4      :  50   4      : 100                       Mean   :101.02  
 5      :  50   5      : 100                       3rd Qu.:111.80  
 6      :  50   6      : 100                       Max.   :149.71  
 (Other):4700   (Other):4400                                       
  FatherVocab      ReadingTime      ToddlerVocab          RT        
 Min.   : 55.43   Min.   : 48.24   Min.   : 26.14   Min.   : 324.7  
 1st Qu.: 88.65   1st Qu.: 87.44   1st Qu.: 53.11   1st Qu.:2396.7  
 Median : 99.82   Median :100.96   Median : 65.96   Median :2917.5  
 Mean   :101.67   Mean   :100.04   Mean   : 65.60   Mean   :2881.2  
 3rd Qu.:114.23   3rd Qu.:113.19   3rd Qu.: 76.42   3rd Qu.:3349.0  
 Max.   :151.56   Max.   :142.65   Max.   :109.20   Max.   :4587.6  
                                                                    
summary(Lexicon)
       ID        SES       MotherVocab      FatherVocab    
 1      :  1   Low :163   Min.   : 39.49   Min.   : 36.71  
 2      :  1   High:187   1st Qu.: 86.56   1st Qu.: 86.40  
 3      :  1              Median : 99.92   Median : 97.79  
 4      :  1              Mean   : 99.56   Mean   : 98.35  
 5      :  1              3rd Qu.:113.23   3rd Qu.:111.59  
 6      :  1              Max.   :149.71   Max.   :190.50  
 (Other):344                                               
  ReadingTime      ToddlerVocab      Enrollment   VerbalAbility   
 Min.   : 40.06   Min.   : 12.24   Min.   :0.00   Min.   :  2.80  
 1st Qu.: 83.25   1st Qu.: 52.63   1st Qu.:0.00   1st Qu.: 80.06  
 Median : 99.60   Median : 64.94   Median :0.00   Median :100.19  
 Mean   : 97.73   Mean   : 64.38   Mean   :0.44   Mean   :100.37  
 3rd Qu.:113.05   3rd Qu.: 75.90   3rd Qu.:1.00   3rd Qu.:119.54  
 Max.   :148.35   Max.   :112.19   Max.   :1.00   Max.   :197.73  
                                                                  
 ReadingComprehension  Orthography       Morphology    
 Min.   : 48.73       Min.   : 42.81   Min.   : 46.33  
 1st Qu.: 77.04       1st Qu.: 84.58   1st Qu.: 85.27  
 Median : 87.05       Median : 99.92   Median : 99.97  
 Mean   : 86.56       Mean   : 99.33   Mean   :100.03  
 3rd Qu.: 96.12       3rd Qu.:114.06   3rd Qu.:114.80  
 Max.   :126.68       Max.   :179.51   Max.   :170.14  
                                                       
   Phonetics          Syntax         Semantics        Pragmatics    
 Min.   : 42.03   Min.   : 38.06   Min.   : 50.87   Min.   : 40.47  
 1st Qu.: 87.09   1st Qu.: 84.12   1st Qu.: 86.59   1st Qu.: 86.91  
 Median :100.06   Median : 99.60   Median :100.64   Median :100.80  
 Mean   : 99.84   Mean   : 98.91   Mean   : 99.91   Mean   :100.15  
 3rd Qu.:111.78   3rd Qu.:112.86   3rd Qu.:112.08   3rd Qu.:113.22  
 Max.   :157.73   Max.   :156.90   Max.   :167.64   Max.   :153.01  
                                                                    

Study 1

In the case of the first study, Practitioners would like to re-analyse previously collected data. This data set collects measures of vocabulary size - approximation of how many English words each toddler knows, socioeconomic status of the family in which the toddlers are growing up (high and low), father and mother’s vocabulary size, and time that parents spend reading to toddlers. They formed several hypothetical assumptions:

Question :

Specify and estimate a linear model that tests for the research questions postulated by the “Practitioners” team.

model1 <- lm(ToddlerVocab ~ MotherVocab, data = Lexicon)
summary(model1)

Call:
lm(formula = ToddlerVocab ~ MotherVocab, data = Lexicon)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.181  -9.943  -0.007  10.730  39.840 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.76300    4.05976   7.085 7.77e-12 ***
MotherVocab  0.35776    0.03996   8.952  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.1 on 348 degrees of freedom
Multiple R-squared:  0.1872,    Adjusted R-squared:  0.1848 
F-statistic: 80.14 on 1 and 348 DF,  p-value: < 2.2e-16
model2 <- lm(ToddlerVocab ~ FatherVocab, data = Lexicon)
summary(model2)

Call:
lm(formula = ToddlerVocab ~ FatherVocab, data = Lexicon)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.784 -12.026   0.079  11.532  49.977 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.88488    4.40094  12.017  < 2e-16 ***
FatherVocab  0.11688    0.04383   2.667  0.00802 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.58 on 348 degrees of freedom
Multiple R-squared:  0.02002,   Adjusted R-squared:  0.01721 
F-statistic: 7.111 on 1 and 348 DF,  p-value: 0.008019
#First model
model3 <- lm(ToddlerVocab ~ MotherVocab*FatherVocab, data = Lexicon)
summary(model3)

Call:
lm(formula = ToddlerVocab ~ MotherVocab * FatherVocab, data = Lexicon)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.881  -9.992   0.194  10.493  39.954 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)   
(Intercept)             13.017364  18.723989   0.695  0.48738   
MotherVocab              0.484281   0.185924   2.605  0.00959 **
FatherVocab              0.170537   0.191124   0.892  0.37286   
MotherVocab:FatherVocab -0.001377   0.001859  -0.741  0.45921   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.12 on 346 degrees of freedom
Multiple R-squared:  0.1899,    Adjusted R-squared:  0.1829 
F-statistic: 27.04 on 3 and 346 DF,  p-value: 9.871e-16
#Second model
modelvoc <- lm(ToddlerVocab ~ ReadingTime, data = Lexicon)
summary(modelvoc)

Call:
lm(formula = ToddlerVocab ~ ReadingTime, data = Lexicon)

Residuals:
   Min     1Q Median     3Q    Max 
-54.76 -10.87  -0.28  10.28  53.79 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 44.23382    4.11002  10.762  < 2e-16 ***
ReadingTime  0.20615    0.04112   5.014 8.51e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.17 on 348 degrees of freedom
Multiple R-squared:  0.06737,   Adjusted R-squared:  0.06469 
F-statistic: 25.14 on 1 and 348 DF,  p-value: 8.51e-07
#Third model
model_ses <- lm(ToddlerVocab ~ ReadingTime:SES, data= Lexicon)
summary(model_ses)

Call:
lm(formula = ToddlerVocab ~ ReadingTime:SES, data = Lexicon)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.444 -10.152   0.214  10.867  57.182 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         43.89958    3.95364  11.104  < 2e-16 ***
ReadingTime:SESLow   0.16163    0.04040   4.001 7.71e-05 ***
ReadingTime:SESHigh  0.25166    0.04043   6.224 1.40e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.56 on 347 degrees of freedom
Multiple R-squared:  0.1397,    Adjusted R-squared:  0.1347 
F-statistic: 28.17 on 2 and 347 DF,  p-value: 4.606e-12
# Checking model differences
anova(modelvoc,model_ses)
Analysis of Variance Table

Model 1: ToddlerVocab ~ ReadingTime
Model 2: ToddlerVocab ~ ReadingTime:SES
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    348 91042                                  
2    347 83983  1    7058.7 29.165 1.236e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Checking model differences
anova(model3,modelvoc,model_ses)
Analysis of Variance Table

Model 1: ToddlerVocab ~ MotherVocab * FatherVocab
Model 2: ToddlerVocab ~ ReadingTime
Model 3: ToddlerVocab ~ ReadingTime:SES
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    346 79081                                  
2    348 91042 -2  -11960.6 26.165 2.619e-11 ***
3    347 83983  1    7058.7 30.883 5.479e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Final model
model_ses_final <- lm(ToddlerVocab ~ MotherVocab + FatherVocab + ReadingTime*SES,  data=Lexicon)
summary(model_ses_final)

Call:
lm(formula = ToddlerVocab ~ MotherVocab + FatherVocab + ReadingTime * 
    SES, data = Lexicon)

Residuals:
    Min      1Q  Median      3Q     Max 
-53.065  -9.140  -0.265  10.086  46.396 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         22.938659   6.429529   3.568 0.000411 ***
MotherVocab          0.319085   0.039544   8.069 1.19e-14 ***
FatherVocab          0.005765   0.039737   0.145 0.884726    
ReadingTime          0.051234   0.054690   0.937 0.349511    
SESHigh             -9.428745   7.291130  -1.293 0.196816    
ReadingTime:SESHigh  0.175596   0.072844   2.411 0.016452 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.26 on 344 degrees of freedom
Multiple R-squared:  0.283, Adjusted R-squared:  0.2726 
F-statistic: 27.16 on 5 and 344 DF,  p-value: < 2.2e-16

Plotting the model

plot(model_ses_final)

A linear model was suitable here because the hypothesized relationship suggested linear relationship between predictors and the response. I took an iterative approach to check models by adding variables in every model and compare the significance to keep the ones that explains variance significantly.
Model1
model3 <- lm(ToddlerVocab ~ MotherVocabFatherVocab, data = Lexicon)
I decided to include MotherVocab and FatherVocab as the hypothesis stated that toddlers with higher vocabulary sizes will also have parents with stronger linguistic abilities. The main effect plus interaction does explain a significant of variance.
Model2
modelvoc <- lm(ToddlerVocab ~ ReadingTime, data = Lexicon)
As toddlers that might spend more time reading will have higher vocabulary size, I checked how the main effect of ReadingTime.
Model 3
model_ses <- lm(ToddlerVocab ~ ReadingTime:SES, data= Lexicon)
As the hypothesis stated effect of Reading time will depend on SES levels of the toddlers, I checked out the main effect and interaction terms of terms. Here, effect of reading time (ReadingTime) will be moderated by socioeconomic status (SES), where toddlers in more affluent families benefit more from reading time.
Final Model
lm(ToddlerVocab ~ MotherVocab + FatherVocab + ReadingTime
SES, data=Lexicon)
Hence, I decided to include main effect of MotherVocab and FatherVocab as well as, main and iterative effect of ReadingTime and SES in tandem with the hypothesized relationships along with the theoretical support of above model evaluations. The interaction effect will let me check if the relationship changes depending on the level of SES, that will in turn test the hypothesis if relationship between ReadingTime, ToddlerVocab will be higher for toddlers with high SES.


Study 2

The Practitioners team collected additional information for the next three research questions. The toddlers from the original study are now young adults, and Practitioners contacted them to collect additional information, such as measures of verbal abilities (verbal intelligence), reading comprehension performance, and finally, whether they enrolled at the University. For the second research project, Practitioners would like to focus on testing predictions from two theoretical frameworks that argue for nature or nurture mechanistic pathways in learning verbal behaviour. The Verbal nurturing theory states that practice is the main important factor, where reading time influences vocabulary size, which results in better reading comprehension outcomes (see Figure 1). The claims from the verbal nurturing theory are illustrated using a direct pathway c, from Reading time to Reading comprehension and indirect pathways a (from Reading time to Vocabulary size) and b (from Vocabulary size to Reading Comprehension).
The Verbal nature theory assumes individual differences in the verbal capabilities of people. It argues that verbal talent (Verbal ability) is the main factor that influences Vocabulary size and consequently results in better reading comprehension outcomes (see Figure 1). The claims from Verbal nature theory are illustrated using direct pathway e (from Verbal ability to Reading comprehension), and indirect pathways d (from Verbal ability to Vocabulary size) and b (from Vocabulary size to Reading comprehension). The Verbal nature theory also assumes that Verbal ability directly influences Reading time, but they do not expect the existence of any indirect effects over Reading time. This is illustrated as a pathway k.
Practitioners would like to specify a path model that tests for the proposed pathways by the two theories. To be able to do so, you need to specify two indirect effects: indirect effect 1 for Verbal nurturing theory and indirect effect 2 for Verbal nature theory, as well as two total effects, that is combinations of relevant direct and indirect effects.

Question :

Specify and estimate a path model that tests the proposed pathways postulated by the “Practitioners” team.

library(tidyverse)
library(ggplot2)
library(tidyr)
require('tidySEM')
require(lavaan)
library(statmod)
library(semPlot)
require('tidySEM')
path_model1 ="
ToddlerVocab ~ a*ReadingTime
ReadingComprehension ~ b*ToddlerVocab
ReadingComprehension ~ c*ReadingTime
ToddlerVocab ~ d*VerbalAbility
ReadingComprehension ~ e*VerbalAbility
ReadingTime ~ k*VerbalAbility
ind1 := a*b
tot1 := c+(a*b)
ind2 := d*b
tot2 := e+(d*b)
"
fit_path_model= sem(path_model1, data = Lexicon)
summary(fit_path_model, fit.measures=T, standardized=T)
lavaan 0.6-8 ended normally after 54 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9
                                                      
  Number of observations                           350
                                                      
Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Model Test Baseline Model:

  Test statistic                              1888.649
  Degrees of freedom                                 6
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3514.250
  Loglikelihood unrestricted model (H1)      -3514.250
                                                      
  Akaike (AIC)                                7046.500
  Bayesian (BIC)                              7081.221
  Sample-size adjusted Bayesian (BIC)         7052.670

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value RMSEA <= 0.05                             NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                         Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  ToddlerVocab ~                                                     
    ReadingTim (a)          0.209    0.041    5.092    0.000    0.209
  ReadingComprehension ~                                             
    ToddlerVcb (b)          0.502    0.003  157.332    0.000    0.502
    ReadingTim (c)          0.297    0.003  116.957    0.000    0.297
  ToddlerVocab ~                                                     
    VerblAblty (d)          0.028    0.029    0.948    0.343    0.028
  ReadingComprehension ~                                             
    VerblAblty (e)          0.249    0.002  143.143    0.000    0.249
  ReadingTime ~                                                      
    VerblAblty (k)         -0.054    0.038   -1.428    0.153   -0.054
  Std.all
         
    0.263
         
    0.611
    0.455
         
    0.049
         
    0.538
         
   -0.076

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
   .ToddlerVocab    259.454   19.613   13.229    0.000  259.454
   .ReadngCmprhnsn    0.926    0.070   13.229    0.000    0.926
   .ReadingTime     439.607   33.231   13.229    0.000  439.607
  Std.all
    0.930
    0.005
    0.994

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
    ind1              0.105    0.021    5.090    0.000    0.105
    tot1              0.402    0.021   19.369    0.000    0.402
    ind2              0.014    0.015    0.948    0.343    0.014
    tot2              0.263    0.015   17.865    0.000    0.263
  Std.all
    0.161
    0.616
    0.030
    0.568
#plotting path model
graph_sem(fit_path_model, variance_diameter=.2)

# Checking summary
summary(fit_path_model, fit.measures=T, standardized=T)
lavaan 0.6-8 ended normally after 54 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9
                                                      
  Number of observations                           350
                                                      
Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Model Test Baseline Model:

  Test statistic                              1888.649
  Degrees of freedom                                 6
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3514.250
  Loglikelihood unrestricted model (H1)      -3514.250
                                                      
  Akaike (AIC)                                7046.500
  Bayesian (BIC)                              7081.221
  Sample-size adjusted Bayesian (BIC)         7052.670

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value RMSEA <= 0.05                             NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                         Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  ToddlerVocab ~                                                     
    ReadingTim (a)          0.209    0.041    5.092    0.000    0.209
  ReadingComprehension ~                                             
    ToddlerVcb (b)          0.502    0.003  157.332    0.000    0.502
    ReadingTim (c)          0.297    0.003  116.957    0.000    0.297
  ToddlerVocab ~                                                     
    VerblAblty (d)          0.028    0.029    0.948    0.343    0.028
  ReadingComprehension ~                                             
    VerblAblty (e)          0.249    0.002  143.143    0.000    0.249
  ReadingTime ~                                                      
    VerblAblty (k)         -0.054    0.038   -1.428    0.153   -0.054
  Std.all
         
    0.263
         
    0.611
    0.455
         
    0.049
         
    0.538
         
   -0.076

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
   .ToddlerVocab    259.454   19.613   13.229    0.000  259.454
   .ReadngCmprhnsn    0.926    0.070   13.229    0.000    0.926
   .ReadingTime     439.607   33.231   13.229    0.000  439.607
  Std.all
    0.930
    0.005
    0.994

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
    ind1              0.105    0.021    5.090    0.000    0.105
    tot1              0.402    0.021   19.369    0.000    0.402
    ind2              0.014    0.015    0.948    0.343    0.014
    tot2              0.263    0.015   17.865    0.000    0.263
  Std.all
    0.161
    0.616
    0.030
    0.568
# Checking if models are identical
identical(summary(fit_path_model),summary(fit_path_model, fit.measures=T, standardized=T))
lavaan 0.6-8 ended normally after 54 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9
                                                      
  Number of observations                           350
                                                      
Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                         Estimate  Std.Err  z-value  P(>|z|)
  ToddlerVocab ~                                            
    ReadingTim (a)          0.209    0.041    5.092    0.000
  ReadingComprehension ~                                    
    ToddlerVcb (b)          0.502    0.003  157.332    0.000
    ReadingTim (c)          0.297    0.003  116.957    0.000
  ToddlerVocab ~                                            
    VerblAblty (d)          0.028    0.029    0.948    0.343
  ReadingComprehension ~                                    
    VerblAblty (e)          0.249    0.002  143.143    0.000
  ReadingTime ~                                             
    VerblAblty (k)         -0.054    0.038   -1.428    0.153

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .ToddlerVocab    259.454   19.613   13.229    0.000
   .ReadngCmprhnsn    0.926    0.070   13.229    0.000
   .ReadingTime     439.607   33.231   13.229    0.000

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)
    ind1              0.105    0.021    5.090    0.000
    tot1              0.402    0.021   19.369    0.000
    ind2              0.014    0.015    0.948    0.343
    tot2              0.263    0.015   17.865    0.000

lavaan 0.6-8 ended normally after 54 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9
                                                      
  Number of observations                           350
                                                      
Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Model Test Baseline Model:

  Test statistic                              1888.649
  Degrees of freedom                                 6
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3514.250
  Loglikelihood unrestricted model (H1)      -3514.250
                                                      
  Akaike (AIC)                                7046.500
  Bayesian (BIC)                              7081.221
  Sample-size adjusted Bayesian (BIC)         7052.670

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value RMSEA <= 0.05                             NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                         Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  ToddlerVocab ~                                                     
    ReadingTim (a)          0.209    0.041    5.092    0.000    0.209
  ReadingComprehension ~                                             
    ToddlerVcb (b)          0.502    0.003  157.332    0.000    0.502
    ReadingTim (c)          0.297    0.003  116.957    0.000    0.297
  ToddlerVocab ~                                                     
    VerblAblty (d)          0.028    0.029    0.948    0.343    0.028
  ReadingComprehension ~                                             
    VerblAblty (e)          0.249    0.002  143.143    0.000    0.249
  ReadingTime ~                                                      
    VerblAblty (k)         -0.054    0.038   -1.428    0.153   -0.054
  Std.all
         
    0.263
         
    0.611
    0.455
         
    0.049
         
    0.538
         
   -0.076

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
   .ToddlerVocab    259.454   19.613   13.229    0.000  259.454
   .ReadngCmprhnsn    0.926    0.070   13.229    0.000    0.926
   .ReadingTime     439.607   33.231   13.229    0.000  439.607
  Std.all
    0.930
    0.005
    0.994

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
    ind1              0.105    0.021    5.090    0.000    0.105
    tot1              0.402    0.021   19.369    0.000    0.402
    ind2              0.014    0.015    0.948    0.343    0.014
    tot2              0.263    0.015   17.865    0.000    0.263
  Std.all
    0.161
    0.616
    0.030
    0.568

[1] FALSE
# Making path model 2
path_model2 ="
ToddlerVocab ~ a*ReadingTime
ReadingComprehension ~ b*ToddlerVocab
ReadingComprehension ~ c*ReadingTime
ind1 := a*b
tot1 := c+(a*b)
"
fit_path_model2= sem(path_model2, data = Lexicon)
graph_sem(fit_path_model2, variance_diameter=.2)

# Checking summary of model 2
summary(fit_path_model2, fit.measures=T, standardized=T)
lavaan 0.6-8 ended normally after 16 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5
                                                      
  Number of observations                           350
                                                      
Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Model Test Baseline Model:

  Test statistic                               455.377
  Degrees of freedom                                 3
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2668.212
  Loglikelihood unrestricted model (H1)      -2668.212
                                                      
  Akaike (AIC)                                5346.425
  Bayesian (BIC)                              5365.714
  Sample-size adjusted Bayesian (BIC)         5349.853

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.000
  P-value RMSEA <= 0.05                             NA

Standardized Root Mean Square Residual:

  SRMR                                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                         Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  ToddlerVocab ~                                                     
    ReadingTim (a)          0.206    0.041    5.028    0.000    0.206
  ReadingComprehension ~                                             
    ToddlerVcb (b)          0.526    0.025   21.355    0.000    0.526
    ReadingTim (c)          0.266    0.020   13.601    0.000    0.266
  Std.all
         
    0.260
         
    0.639
    0.407

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
   .ToddlerVocab    260.120   19.663   13.229    0.000  260.120
   .ReadngCmprhnsn   55.149    4.169   13.229    0.000   55.149
  Std.all
    0.933
    0.292

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
    ind1              0.108    0.022    4.895    0.000    0.108
    tot1              0.374    0.029   13.063    0.000    0.374
  Std.all
    0.166
    0.572
#install.packages("statmod")
library(statmod)

#install.packages("semPlot")
library(semPlot)

semPaths(fit_path_model2, 'model','est', edge.label.cex = 1.1)

# Creating path model 3
path_model3 ="
ToddlerVocab ~ a*ReadingTime
ReadingComprehension ~ b*ToddlerVocab
ReadingComprehension ~ c*ReadingTime
ToddlerVocab ~ 0*VerbalAbility
ReadingComprehension ~ e*VerbalAbility
ReadingTime ~ k*VerbalAbility
ind1 := a*b
tot1 := c+(a*b)
ind2 := 0*b
tot2 := e+(0*b)
"
fit_path_model3= sem(path_model3, data = Lexicon)
summary(fit_path_model3, fit.measures=T, standardized=T)
lavaan 0.6-8 ended normally after 48 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         8
                                                      
  Number of observations                           350
                                                      
Model Test User Model:
                                                      
  Test statistic                                 0.897
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.344

Model Test Baseline Model:

  Test statistic                              1888.649
  Degrees of freedom                                 6
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    1.000
  Tucker-Lewis Index (TLI)                       1.000

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3514.698
  Loglikelihood unrestricted model (H1)      -3514.250
                                                      
  Akaike (AIC)                                7045.397
  Bayesian (BIC)                              7076.260
  Sample-size adjusted Bayesian (BIC)         7050.881

Root Mean Square Error of Approximation:

  RMSEA                                          0.000
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.138
  P-value RMSEA <= 0.05                          0.525

Standardized Root Mean Square Residual:

  SRMR                                           0.022

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                         Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  ToddlerVocab ~                                                     
    ReadingTim (a)          0.206    0.041    5.028    0.000    0.206
  ReadingComprehension ~                                             
    ToddlerVcb (b)          0.502    0.003  157.533    0.000    0.502
    ReadingTim (c)          0.297    0.003  117.080    0.000    0.297
  ToddlerVocab ~                                                     
    VerblAblty              0.000                               0.000
  ReadingComprehension ~                                             
    VerblAblty (e)          0.249    0.002  143.327    0.000    0.249
  ReadingTime ~                                                      
    VerblAblty (k)         -0.054    0.038   -1.428    0.153   -0.054
  Std.all
         
    0.260
         
    0.621
    0.462
         
    0.000
         
    0.547
         
   -0.076

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
   .ToddlerVocab    260.120   19.663   13.229    0.000  260.120
   .ReadngCmprhnsn    0.926    0.070   13.229    0.000    0.926
   .ReadingTime     439.607   33.231   13.229    0.000  439.607
  Std.all
    0.933
    0.005
    0.994

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
    ind1              0.104    0.021    5.026    0.000    0.104
    tot1              0.401    0.021   19.328    0.000    0.401
    ind2              0.000                               0.000
    tot2              0.249    0.002  143.327    0.000    0.249
  Std.all
    0.161
    0.623
    0.000
    0.547
semPaths(fit_path_model3, 'model','est', edge.label.cex = 1.1)

# Comparing models
require(semTools)
diff<-compareFit(fit_path_model,fit_path_model3)
summary(diff)
################### Nested Model Comparison #########################
Chi-Squared Difference Test

                Df    AIC    BIC  Chisq Chisq diff Df diff Pr(>Chisq)
fit_path_model   0 7046.5 7081.2 0.0000                              
fit_path_model3  1 7045.4 7076.3 0.8967    0.89668       1     0.3437

####################### Model Fit Indices ###########################

################## Differences in Fit Indices #######################
NA

I used structural equation model (SEM) as it evaluates complex connected relationships very well.


Study 3

For the third research project, Practitioners would like to focus on the information behind university enrollment. For each participant, they recorded whether they enrolled at University (1 – yes, 0 – no). Practitioners would like to build a model that tests which variables are predictive of enrollment status. They aim to use the initially collected variables from when the participants were toddlers, such as mothers’, fathers’ and toddlers’ vocabulary size, as well as reading time and socioeconomic status. If this type of model works well to predict distal outcomes, such as university enrollment, we can use this information and improve behaviour that leads to more positive outcomes (enrolling at the university).

Question :

Specify the statistical model that tests which factors are predictive of University enrollment.

# Fixing the non-normal datapoints
Q <- quantile(Lexicon$FatherVocab, probs = c(.25,.75), na.rm = FALSE)
iqr<- IQR(Lexicon$FatherVocab)
up<- Q[2]+1.5*iqr
down<- Q[1]-1.5*iqr

Lexi_normal<- subset(Lexicon, Lexicon$FatherVocab > (Q[1]-1.5*iqr) & Lexicon$FatherVocab < (Q[2]+1.5*iqr))
# Define full and null models and do step procedure
model.null = glm(Enrollment ~ 1,
                 data=Lexi_normal,
                 family = binomial(link="logit")
)
model.full = glm(Enrollment~SES + FatherVocab + MotherVocab + ToddlerVocab + ReadingTime + VerbalAbility + ReadingComprehension + Orthography + Morphology + Phonetics + Syntax + Pragmatics,
                 data=Lexi_normal,
                 family = binomial(link="logit")
)
step(model.null,
     scope = list(upper=model.full),
     direction="both",
     test="Chisq",
     data=Lexi_normal)
Start:  AIC=474.9
Enrollment ~ 1

                       Df Deviance    AIC    LRT  Pr(>Chi)    
+ SES                   1   395.51 399.51 77.391 < 2.2e-16 ***
+ ToddlerVocab          1   436.68 440.68 36.215 1.767e-09 ***
+ ReadingComprehension  1   442.03 446.03 30.873 2.755e-08 ***
+ ReadingTime           1   453.12 457.12 19.779 8.691e-06 ***
+ MotherVocab           1   467.48 471.48  5.414   0.01998 *  
+ Pragmatics            1   469.90 473.90  2.996   0.08345 .  
<none>                      472.90 474.90                     
+ Syntax                1   471.10 475.10  1.796   0.18023    
+ FatherVocab           1   471.38 475.38  1.517   0.21811    
+ Morphology            1   471.46 475.46  1.437   0.23055    
+ Orthography           1   472.25 476.25  0.643   0.42255    
+ Phonetics             1   472.77 476.77  0.124   0.72424    
+ VerbalAbility         1   472.82 476.82  0.077   0.78205    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Step:  AIC=399.51
Enrollment ~ SES

                       Df Deviance    AIC    LRT  Pr(>Chi)    
+ ReadingTime           1   368.04 374.04 27.468 1.597e-07 ***
+ ReadingComprehension  1   376.07 382.07 19.434 1.042e-05 ***
+ ToddlerVocab          1   376.88 382.88 18.625 1.591e-05 ***
+ MotherVocab           1   391.56 397.56  3.948   0.04694 *  
+ Pragmatics            1   393.50 399.50  2.010   0.15629    
<none>                      395.51 399.51                     
+ FatherVocab           1   393.56 399.56  1.945   0.16315    
+ VerbalAbility         1   394.40 400.40  1.111   0.29193    
+ Syntax                1   394.62 400.62  0.884   0.34697    
+ Morphology            1   394.87 400.87  0.633   0.42610    
+ Orthography           1   395.16 401.16  0.350   0.55407    
+ Phonetics             1   395.51 401.51  0.000   0.98489    
- SES                   1   472.90 474.90 77.391 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Step:  AIC=374.04
Enrollment ~ SES + ReadingTime

                       Df Deviance    AIC    LRT  Pr(>Chi)    
+ ToddlerVocab          1   358.89 366.89  9.152  0.002484 ** 
+ ReadingComprehension  1   365.35 373.35  2.684  0.101338    
<none>                      368.04 374.04                     
+ Pragmatics            1   366.79 374.79  1.250  0.263617    
+ MotherVocab           1   366.82 374.82  1.222  0.269032    
+ Syntax                1   367.11 375.11  0.925  0.336157    
+ VerbalAbility         1   367.36 375.36  0.675  0.411443    
+ Morphology            1   367.60 375.60  0.442  0.506382    
+ Orthography           1   367.62 375.62  0.420  0.517016    
+ FatherVocab           1   367.92 375.92  0.119  0.730168    
+ Phonetics             1   368.03 376.03  0.009  0.925965    
- ReadingTime           1   395.51 399.51 27.468 1.597e-07 ***
- SES                   1   453.12 457.12 85.080 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Step:  AIC=366.89
Enrollment ~ SES + ReadingTime + ToddlerVocab

                       Df Deviance    AIC    LRT  Pr(>Chi)    
<none>                      358.89 366.89                     
+ VerbalAbility         1   358.09 368.09  0.794  0.373008    
+ Syntax                1   358.15 368.15  0.734  0.391505    
+ ReadingComprehension  1   358.17 368.17  0.715  0.397836    
+ Pragmatics            1   358.27 368.27  0.618  0.431884    
+ Morphology            1   358.41 368.41  0.480  0.488382    
+ Orthography           1   358.64 368.64  0.250  0.616983    
+ Phonetics             1   358.87 368.87  0.020  0.888581    
+ MotherVocab           1   358.87 368.87  0.013  0.908738    
+ FatherVocab           1   358.89 368.89  0.001  0.972086    
- ToddlerVocab          1   368.04 374.04  9.152  0.002484 ** 
- ReadingTime           1   376.88 382.88 17.996 2.214e-05 ***
- SES                   1   427.16 433.16 68.277 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Call:  glm(formula = Enrollment ~ SES + ReadingTime + ToddlerVocab, 
    family = binomial(link = "logit"), data = Lexi_normal)

Coefficients:
 (Intercept)       SESHigh   ReadingTime  ToddlerVocab  
    -5.83233       2.12945       0.02784       0.02520  

Degrees of Freedom: 344 Total (i.e. Null);  341 Residual
Null Deviance:      472.9 
Residual Deviance: 358.9    AIC: 366.9

Final model

#Final model
model.final =glm(formula = Enrollment ~ SES + ReadingTime + ToddlerVocab, 
                 family = binomial(link = "logit"), data = Lexi_normal)

summary(model.final)

Call:
glm(formula = Enrollment ~ SES + ReadingTime + ToddlerVocab, 
    family = binomial(link = "logit"), data = Lexi_normal)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0935  -0.7605  -0.3821   0.8825   2.3950  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -5.832333   0.840261  -6.941 3.89e-12 ***
SESHigh       2.129449   0.279534   7.618 2.58e-14 ***
ReadingTime   0.027840   0.006848   4.065 4.80e-05 ***
ToddlerVocab  0.025196   0.008460   2.978   0.0029 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 472.90  on 344  degrees of freedom
Residual deviance: 358.89  on 341  degrees of freedom
AIC: 366.89

Number of Fisher Scoring iterations: 4
#Checking for correlation
df_glm <- Lexi_normal[,c("ToddlerVocab", "ReadingTime" )]
library("PerformanceAnalytics")
chart.Correlation(df_glm, histogram=TRUE, pch=19)

#Analysis of variance for individual terms
#install.packages("car")
library(car)
Anova(model.final, type="II", test="Wald")
Analysis of Deviance Table (Type II tests)

Response: Enrollment
             Df   Chisq Pr(>Chisq)    
SES           1 58.0316  2.579e-14 ***
ReadingTime   1 16.5272  4.796e-05 ***
ToddlerVocab  1  8.8697   0.002899 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Pseudo-R-squared
#install.packages("rcompanion")
library(rcompanion)
nagelkerke(model.final)
$Models
                                                                                                    
Model: "glm, Enrollment ~ SES + ReadingTime + ToddlerVocab, binomial(link = \"logit\"), Lexi_normal"
Null:  "glm, Enrollment ~ 1, binomial(link = \"logit\"), Lexi_normal"                               

$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                             0.241091
Cox and Snell (ML)                   0.281413
Nagelkerke (Cragg and Uhler)         0.377191

$Likelihood.ratio.test
 Df.diff LogLik.diff  Chisq    p.value
      -3     -57.006 114.01 1.5026e-24

$Number.of.observations
          
Model: 345
Null:  345

$Messages
[1] "Note: For models fit with REML, these statistics are based on refitting with ML"

$Warnings
[1] "None"
#odds ratio
#install.packages("broom")
library(broom)
tidy(model.final, exponentiate = T, conf.int = T)
NA

The odds-ratio was >1, so having more SES, ReadingTime and ToddlerVocab increased the chance of enrollment. The confidence interval for SES, ReadingTime and ToddlerVocab does not includes 1.0 so the results are statistically significant.

Lexi_normal$probability = predict(model.final,newdata = NULL, type = 'response')
pairs.panels(Lexi_normal)

Lexi_normal$probability <- ifelse(Lexi_normal$probability>0.5,1,0)
# making a confusion matrix
cm<-table(col=Lexi_normal$Enrollment,Lexi_normal$probability)
print(cm)
   
col   0   1
  0 150  44
  1  45 106
sum(diag(cm))/sum(colSums(cm))
[1] 0.742029
#install.packages("caret")
library(caret)

# checking precision n f1-score
confusionMatrix(cm, mode = "prec_recall")
Confusion Matrix and Statistics

   
col   0   1
  0 150  44
  1  45 106
                                          
               Accuracy : 0.742           
                 95% CI : (0.6925, 0.7874)
    No Information Rate : 0.5652          
    P-Value [Acc > NIR] : 7.278e-12       
                                          
                  Kappa : 0.4755          
                                          
 Mcnemar's Test P-Value : 1               
                                          
              Precision : 0.7732          
                 Recall : 0.7692          
                     F1 : 0.7712          
             Prevalence : 0.5652          
         Detection Rate : 0.4348          
   Detection Prevalence : 0.5623          
      Balanced Accuracy : 0.7379          
                                          
       'Positive' Class : 0               
                                          
#confidence intervals
exp(confint(model.final))
Waiting for profiling to be done...
                    2.5 %      97.5 %
(Intercept)  0.0005252926  0.01427845
SESHigh      4.9306587127 14.79015580
ReadingTime  1.0148298671  1.04251953
ToddlerVocab 1.0088537972  1.04300408

I used generalized linear model with binomial family as the dependent variable (Enrollment) is binary and predictors (SES, ReadingTime and ToddlerVocab) confirm normality distribution. I’ve used a stepwise procedure using the step function that selects models to minimize AIC to arrive at this specification. As the predictors have low correlation (0.26), any order gives the same result.


Study 4

In their previous work, Practitioners developed a questionnaire to investigate levels of language expertise. The questionnaire collects six measures, each focusing on a specific level of language processing: orthography, phonetics, morphology, syntax, semantics, pragmatics. When investigating the latent structure of this questionnaire, Practitioners proposed a two-factor solution. The first factor was described as Rule automatization, with orthography, phonetics and morphology highly loaded on it. The second factor was described as Fine expertise with knowledge of syntax, semantics, and pragmatics highly loading on it.
In the case of this research project, Practitioners used our participants to resample indices of language expertise. In other words, they collected additional measures of language performance using their questionnaire. In the case of the fourth study, they aimed to validate their proposed two-factor solution for this questionnaire and planned to run a confirmatory factor analysis on this data. Can you help them with this goal?

Question :

Specify and estimate a confirmatory factor model that tests the proposed latent structure. NOTE: use scaling of the variance to 1 for the latent factor structure.

library(lavaan)
library(psych)
scatter.hist(x=Lexicon$Orthography, y=Lexicon$Semantics, density=T, ellipse=T)


fa1 ='
RuleAutomization=~ NA*Orthography + Phonetics + Morphology
FineExpertise=~ NA*Syntax + Semantics + Pragmatics
RuleAutomization ~~ 1*RuleAutomization
FineExpertise ~~ 1*FineExpertise
'

cfa1 = cfa(fa1, data=Lexicon)

summary(cfa1, standardize=TRUE, fit.measures=TRUE)
lavaan 0.6-8 ended normally after 23 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        13
                                                      
  Number of observations                           350
                                                      
Model Test User Model:
                                                      
  Test statistic                                 9.113
  Degrees of freedom                                 8
  P-value (Chi-square)                           0.333

Model Test Baseline Model:

  Test statistic                               734.539
  Degrees of freedom                                15
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.998
  Tucker-Lewis Index (TLI)                       0.997

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -8975.372
  Loglikelihood unrestricted model (H1)      -8970.815
                                                      
  Akaike (AIC)                               17976.743
  Bayesian (BIC)                             18026.896
  Sample-size adjusted Bayesian (BIC)        17985.656

Root Mean Square Error of Approximation:

  RMSEA                                          0.020
  90 Percent confidence interval - lower         0.000
  90 Percent confidence interval - upper         0.068
  P-value RMSEA <= 0.05                          0.810

Standardized Root Mean Square Residual:

  SRMR                                           0.034

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                      Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  RuleAutomization =~                                             
    Orthography         17.720    1.081   16.387    0.000   17.720
    Phonetics           11.616    1.041   11.163    0.000   11.616
    Morphology          16.438    1.120   14.680    0.000   16.438
  FineExpertise =~                                                
    Syntax              16.953    1.099   15.429    0.000   16.953
    Semantics           17.099    0.992   17.238    0.000   17.099
    Pragmatics          15.039    1.084   13.867    0.000   15.039
  Std.all
         
    0.870
    0.595
    0.781
         
    0.777
    0.856
    0.706

Covariances:
                      Estimate  Std.Err  z-value  P(>|z|)   Std.lv
  RuleAutomization ~~                                             
    FineExpertise        0.080    0.063    1.267    0.205    0.080
  Std.all
         
    0.080

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv
    RuleAutomizatn    1.000                               1.000
    FineExpertise     1.000                               1.000
   .Orthography     100.384   24.546    4.090    0.000  100.384
   .Phonetics       245.826   21.133   11.632    0.000  245.826
   .Morphology      172.963   23.975    7.214    0.000  172.963
   .Syntax          188.499   22.357    8.431    0.000  188.499
   .Semantics       106.331   19.273    5.517    0.000  106.331
   .Pragmatics      227.307   21.895   10.382    0.000  227.307
  Std.all
    1.000
    1.000
    0.242
    0.646
    0.390
    0.396
    0.267
    0.501
lavPredict(cfa1) 
       RlAtmz FnExpr
  [1,]  0.179  0.438
  [2,]  0.177 -0.160
  [3,] -0.450  0.897
  [4,] -0.372 -1.784
  [5,]  0.292  0.270
  [6,] -0.812 -0.567
  [7,] -0.008 -2.203
  [8,] -2.216 -0.080
  [9,] -1.113  0.842
 [10,] -0.273 -1.086
 [11,] -0.003 -1.314
 [12,]  0.482 -0.929
 [13,]  0.330 -0.973
 [14,]  0.006  0.653
 [15,] -0.220 -1.184
 [16,]  2.136  0.801
 [17,]  1.248 -0.691
 [18,] -0.585  0.129
 [19,] -0.920  1.601
 [20,] -0.255 -0.739
 [21,]  0.721 -0.399
 [22,] -0.310 -0.478
 [23,]  0.398 -0.743
 [24,] -0.543 -1.116
 [25,]  0.617  0.268
 [26,]  0.167 -0.139
 [27,] -0.406  0.320
 [28,] -0.076  0.619
 [29,] -0.804 -0.055
 [30,]  0.624 -0.125
 [31,] -0.122 -0.286
 [32,]  0.686  0.014
 [33,] -0.635  1.134
 [34,] -0.760 -0.782
 [35,]  0.321  1.075
 [36,]  1.447  0.229
 [37,] -0.388 -1.004
 [38,] -1.758  0.735
 [39,]  1.017  0.357
 [40,]  1.836  0.459
 [41,] -1.992 -0.367
 [42,] -0.344 -0.512
 [43,] -0.306  1.230
 [44,] -0.796 -0.562
 [45,] -0.099 -1.449
 [46,]  0.703  0.864
 [47,]  0.524  1.342
 [48,]  0.813 -0.020
 [49,] -0.111 -1.452
 [50,] -1.066 -0.268
 [51,]  0.377 -0.415
 [52,]  0.176 -0.705
 [53,] -1.692 -0.882
 [54,] -0.705  0.513
 [55,]  0.561  1.210
 [56,]  0.814  1.049
 [57,]  0.431 -0.211
 [58,]  0.224 -0.521
 [59,]  1.177  0.004
 [60,]  0.236 -1.278
 [61,] -0.120  2.295
 [62,]  0.724 -0.023
 [63,]  0.689 -1.383
 [64,]  0.673  0.078
 [65,] -0.359 -0.045
 [66,]  0.330 -0.837
 [67,] -0.135 -0.845
 [68,]  1.378  1.186
 [69,] -0.211 -1.264
 [70,] -0.001  0.097
 [71,] -0.900 -1.023
 [72,] -0.652  0.511
 [73,]  1.119  1.446
 [74,] -0.417  1.235
 [75,] -0.911  0.356
 [76,]  1.713  0.456
 [77,] -0.226 -0.589
 [78,]  1.346  0.822
 [79,]  0.286  1.239
 [80,]  1.303 -1.453
 [81,]  0.179 -0.791
 [82,]  0.162  0.386
 [83,]  0.167 -1.572
 [84,] -0.644  0.161
 [85,] -0.365 -1.481
 [86,] -0.087 -0.305
 [87,] -0.945 -0.499
 [88,] -1.874  0.388
 [89,]  0.218 -1.281
 [90,]  0.153  0.374
 [91,] -0.179 -0.218
 [92,] -0.284 -0.812
 [93,]  1.714  0.385
 [94,] -1.020  1.930
 [95,] -0.820 -0.619
 [96,]  0.419  1.311
 [97,] -0.365  0.809
 [98,]  0.541  0.540
 [99,] -0.077 -0.356
[100,] -0.210  0.076
[101,]  1.593  1.192
[102,] -0.995 -0.849
[103,] -0.157  0.750
[104,] -0.483  0.200
[105,] -0.878 -0.467
[106,]  0.144 -0.336
[107,] -0.025 -0.141
[108,] -0.776  1.080
[109,]  0.437  0.188
[110,]  0.183  0.350
[111,] -0.015 -0.579
[112,] -0.336  0.215
[113,] -0.353 -1.382
[114,]  0.982 -0.985
[115,] -0.487 -0.861
[116,]  0.503  0.510
[117,] -0.143 -0.031
[118,]  0.863 -1.288
[119,] -1.383  1.346
[120,] -0.615  0.016
[121,] -1.391 -0.308
[122,]  1.351  0.147
[123,] -0.908 -0.531
[124,] -0.179  1.764
[125,]  0.552  0.092
[126,] -2.006 -0.981
[127,]  0.779 -0.826
[128,] -1.526  1.819
[129,]  0.713  2.853
[130,]  0.413  0.646
[131,] -0.254  0.160
[132,]  0.908  0.300
[133,]  0.449 -1.442
[134,] -1.123 -0.190
[135,] -0.642 -1.373
[136,] -0.246  0.233
[137,]  0.628 -0.241
[138,] -0.124  0.244
[139,] -1.614  0.888
[140,] -0.162  1.068
[141,]  1.119  0.402
[142,] -1.901 -0.150
[143,] -0.131  0.709
[144,]  1.077  1.322
[145,] -1.400  0.476
[146,]  0.060  0.519
[147,] -0.693 -2.104
[148,]  0.729  2.150
[149,] -0.390  0.630
[150,] -1.464  1.256
[151,] -1.273 -0.399
[152,]  1.126  1.625
[153,]  0.993  2.114
[154,]  0.232 -0.737
[155,] -0.951  0.323
[156,] -1.041 -0.019
[157,]  1.024 -0.843
[158,] -0.386 -1.405
[159,]  2.027  2.189
[160,] -0.076  0.754
[161,]  2.169  0.331
[162,]  0.480  0.765
[163,] -0.116  0.338
[164,]  0.813 -1.133
[165,]  1.672 -0.205
[166,] -0.732 -0.633
[167,] -0.366  0.755
[168,] -0.424  0.749
[169,]  0.616 -0.396
[170,]  0.669 -0.208
[171,] -1.666 -1.471
[172,]  0.665  0.755
[173,] -1.337  1.293
[174,] -0.587 -1.934
[175,]  1.638  0.087
[176,] -0.696  0.405
[177,] -1.584  0.355
[178,] -0.920  0.063
[179,] -0.498 -1.001
[180,]  0.534  0.065
[181,] -1.142 -2.090
[182,]  0.689 -1.917
[183,]  1.087  0.375
[184,]  0.385 -0.830
[185,] -0.663  0.760
[186,]  0.509 -0.384
[187,] -0.871 -0.236
[188,] -0.168  0.125
[189,]  0.647  0.173
[190,] -0.230  0.011
[191,]  0.256 -0.697
[192,]  1.091 -0.449
[193,] -1.403 -0.326
[194,]  0.117 -0.062
[195,] -0.227  1.078
[196,]  0.572 -0.158
[197,]  0.534 -1.059
[198,] -1.030 -0.573
[199,] -0.551  1.218
[200,] -0.511  0.603
[201,]  0.128  0.313
[202,]  0.146  0.596
[203,] -0.058 -1.148
[204,] -1.652  1.264
[205,]  1.686 -1.320
[206,]  0.244  0.608
[207,]  1.646  0.240
[208,] -0.073 -1.996
[209,]  0.303  1.217
[210,]  1.240  0.443
[211,]  0.375  0.178
[212,]  2.177 -0.762
[213,]  0.962  0.629
[214,]  0.287  1.631
[215,] -1.668  0.195
[216,]  1.327  0.721
[217,]  0.944  0.083
[218,]  0.729 -0.552
[219,]  1.066 -0.597
[220,]  3.574  0.604
[221,] -1.612 -1.354
[222,] -0.338 -0.569
[223,]  1.123 -0.185
[224,] -0.176  0.300
[225,]  0.002  2.434
[226,]  0.599  1.190
[227,]  0.354 -0.350
[228,] -1.315  0.197
[229,]  0.867 -0.287
[230,] -1.006 -0.169
[231,] -0.293  0.635
[232,] -0.411  0.559
[233,]  0.065 -0.013
[234,] -0.240 -0.007
[235,] -0.137 -1.572
[236,]  1.288 -0.163
[237,]  0.178  1.532
[238,] -0.469  0.300
[239,]  0.246 -0.806
[240,]  0.618  1.156
[241,]  1.214 -0.115
[242,] -1.106  0.370
[243,] -0.137  0.732
[244,]  0.093 -0.288
[245,] -0.474  0.590
[246,]  1.050  0.734
[247,]  0.931 -1.181
[248,] -0.169 -0.953
[249,]  0.334  0.931
[250,]  1.033 -0.143
[251,]  0.420 -0.099
[252,]  1.449  0.293
[253,]  0.841  1.021
[254,] -1.174 -0.727
[255,]  0.699  0.830
[256,] -0.663 -0.818
[257,] -1.186  0.263
[258,]  1.857 -0.060
[259,] -0.218 -1.640
[260,]  0.633  0.566
[261,] -0.309 -0.085
[262,]  0.309  0.326
[263,] -1.033 -1.715
[264,]  0.349  0.756
[265,]  0.446 -0.907
[266,]  0.904  0.256
[267,]  1.198 -0.694
[268,] -0.895 -0.808
[269,] -0.809 -0.042
[270,]  0.137  1.153
[271,] -1.543  1.075
[272,]  1.757  0.105
[273,]  0.408  1.709
[274,]  0.087 -0.753
[275,] -0.578 -0.181
[276,]  0.447  0.536
[277,]  0.710  0.575
[278,] -0.723 -1.828
[279,] -1.088  1.608
[280,]  1.333  0.813
[281,] -0.012 -0.691
[282,]  1.854  0.082
[283,] -0.346  1.284
[284,]  0.724 -0.586
[285,]  0.731  0.730
[286,] -2.244  0.177
[287,]  0.526  0.334
[288,] -1.543 -1.212
[289,]  0.445  0.600
[290,] -0.628 -0.872
[291,] -0.297 -0.474
[292,]  0.960 -0.433
[293,] -1.499 -2.468
[294,] -1.375 -0.215
[295,] -0.488 -0.177
[296,]  0.961 -0.194
[297,]  0.305 -1.016
[298,] -0.512 -1.428
[299,] -1.586  0.981
[300,] -0.560  1.275
[301,] -0.076 -0.185
[302,]  0.751 -0.119
[303,]  0.648 -0.063
[304,]  0.705  0.418
[305,]  0.155 -2.106
[306,]  0.831  0.767
[307,] -1.703 -0.189
[308,]  1.282  0.112
[309,] -0.005  0.704
[310,] -0.276  0.438
[311,] -0.906  0.607
[312,] -1.343 -0.966
[313,] -0.839 -1.059
[314,] -1.920 -0.611
[315,]  0.990  0.272
[316,]  0.862 -1.315
[317,] -0.380  0.321
[318,] -0.332 -0.548
[319,]  0.031  1.364
[320,]  0.170  0.678
[321,]  0.020  0.403
[322,] -1.484  0.086
[323,] -0.282  0.843
[324,]  0.325 -0.465
[325,] -0.724 -0.348
[326,]  0.151  0.284
[327,]  0.526  0.500
[328,] -1.308 -0.407
[329,]  1.082 -1.396
[330,]  0.635 -0.638
[331,]  0.335  0.389
[332,] -0.102  0.465
[333,]  0.460  0.476
[334,] -1.145  2.069
[335,] -0.823 -0.294
[336,] -0.746  1.954
[337,] -0.206  0.294
[338,] -0.655 -0.734
[339,] -1.283 -0.043
[340,] -0.889 -0.246
[341,]  0.888 -1.425
[342,] -0.032 -0.872
[343,] -2.723  0.064
[344,] -0.400 -1.608
[345,]  1.779 -1.218
[346,] -0.539 -1.172
[347,]  0.154  0.853
[348,]  0.268  0.420
[349,] -0.480  0.027
[350,]  1.206  1.282
#Lexicon$probability <- predict(cfa1 , newdata = NULL, type = "response")

fitMIM = cfa(fa1, data=Lexicon, group = 'SES', group.equal='loadings')
summary(fitMIM)
lavaan 0.6-8 ended normally after 32 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        38
  Number of equality constraints                     6
                                                      
  Number of observations per group:                   
    Low                                            163
    High                                           187
                                                      
Model Test User Model:
                                                      
  Test statistic                                21.783
  Degrees of freedom                                22
  P-value (Chi-square)                           0.473
  Test statistic for each group:
    Low                                          9.891
    High                                        11.892

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured


Group 1 [Low]:

Latent Variables:
                      Estimate  Std.Err  z-value  P(>|z|)
  RuleAutomization =~                                    
    Orthgrp (.p1.)      17.759    1.081   16.433    0.000
    Phontcs (.p2.)      11.595    1.038   11.169    0.000
    Mrphlgy (.p3.)      16.412    1.117   14.698    0.000
  FineExpertise =~                                       
    Syntax  (.p4.)      16.845    1.096   15.373    0.000
    Semntcs (.p5.)      17.095    0.992   17.235    0.000
    Prgmtcs (.p6.)      14.951    1.084   13.798    0.000

Covariances:
                      Estimate  Std.Err  z-value  P(>|z|)
  RuleAutomization ~~                                    
    FineExpertise        0.004    0.093    0.048    0.962

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Orthography     100.030    1.579   63.348    0.000
   .Phonetics       100.922    1.540   65.526    0.000
   .Morphology      101.343    1.659   61.084    0.000
   .Syntax          100.414    1.676   59.930    0.000
   .Semantics       100.367    1.579   63.576    0.000
   .Pragmatics      101.420    1.650   61.475    0.000
    RuleAutomizatn    0.000                           
    FineExpertise     0.000                           

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    RuleAutomizatn    1.000                           
    FineExpertise     1.000                           
   .Orthography      91.050   29.789    3.056    0.002
   .Phonetics       252.226   31.066    8.119    0.000
   .Morphology      179.310   31.505    5.691    0.000
   .Syntax          173.836   29.129    5.968    0.000
   .Semantics       113.997   25.388    4.490    0.000
   .Pragmatics      220.105   30.278    7.270    0.000


Group 2 [High]:

Latent Variables:
                      Estimate  Std.Err  z-value  P(>|z|)
  RuleAutomization =~                                    
    Orthgrp (.p1.)      17.759    1.081   16.433    0.000
    Phontcs (.p2.)      11.595    1.038   11.169    0.000
    Mrphlgy (.p3.)      16.412    1.117   14.698    0.000
  FineExpertise =~                                       
    Syntax  (.p4.)      16.845    1.096   15.373    0.000
    Semntcs (.p5.)      17.095    0.992   17.235    0.000
    Prgmtcs (.p6.)      14.951    1.084   13.798    0.000

Covariances:
                      Estimate  Std.Err  z-value  P(>|z|)
  RuleAutomization ~~                                    
    FineExpertise        0.143    0.086    1.670    0.095

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Orthography      98.718    1.504   65.637    0.000
   .Phonetics        98.901    1.413   70.004    0.000
   .Morphology       98.885    1.528   64.731    0.000
   .Syntax           97.602    1.611   60.574    0.000
   .Semantics        99.508    1.444   68.897    0.000
   .Pragmatics       99.042    1.566   63.234    0.000
    RuleAutomizatn    0.000                           
    FineExpertise     0.000                           

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    RuleAutomizatn    1.000                           
    FineExpertise     1.000                           
   .Orthography     107.629   29.019    3.709    0.000
   .Phonetics       238.810   27.792    8.593    0.000
   .Morphology      167.044   28.954    5.769    0.000
   .Syntax          201.741   29.807    6.768    0.000
   .Semantics        97.838   23.595    4.147    0.000
   .Pragmatics      235.204   29.901    7.866    0.000
fitMIC <- cfa(fa1, data = Lexicon, group = 'SES') 
summary(fitMIC)
lavaan 0.6-8 ended normally after 36 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        38
                                                      
  Number of observations per group:                   
    Low                                            163
    High                                           187
                                                      
Model Test User Model:
                                                      
  Test statistic                                13.914
  Degrees of freedom                                16
  P-value (Chi-square)                           0.605
  Test statistic for each group:
    Low                                          5.743
    High                                         8.171

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured


Group 1 [Low]:

Latent Variables:
                      Estimate  Std.Err  z-value  P(>|z|)
  RuleAutomization =~                                    
    Orthography         19.872    1.616   12.298    0.000
    Phonetics           12.066    1.561    7.728    0.000
    Morphology          15.645    1.643    9.525    0.000
  FineExpertise =~                                       
    Syntax              15.403    1.563    9.856    0.000
    Semantics           18.165    1.511   12.023    0.000
    Pragmatics          13.920    1.562    8.914    0.000

Covariances:
                      Estimate  Std.Err  z-value  P(>|z|)
  RuleAutomization ~~                                    
    FineExpertise        0.027    0.089    0.307    0.758

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Orthography     100.030    1.650   60.626    0.000
   .Phonetics       100.922    1.574   64.109    0.000
   .Morphology      101.343    1.663   60.926    0.000
   .Syntax          100.414    1.623   61.869    0.000
   .Semantics       100.367    1.597   62.860    0.000
   .Pragmatics      101.420    1.611   62.959    0.000
    RuleAutomizatn    0.000                           
    FineExpertise     0.000                           

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    RuleAutomizatn    1.000                           
    FineExpertise     1.000                           
   .Orthography      48.847   42.036    1.162    0.245
   .Phonetics       258.364   32.486    7.953    0.000
   .Morphology      206.223   34.489    5.979    0.000
   .Syntax          192.101   31.012    6.194    0.000
   .Semantics        85.586   32.773    2.611    0.009
   .Pragmatics      229.227   31.372    7.307    0.000


Group 2 [High]:

Latent Variables:
                      Estimate  Std.Err  z-value  P(>|z|)
  RuleAutomization =~                                    
    Orthography         15.840    1.457   10.874    0.000
    Phonetics           11.065    1.396    7.928    0.000
    Morphology          17.226    1.541   11.180    0.000
  FineExpertise =~                                       
    Syntax              18.148    1.534   11.833    0.000
    Semantics           16.385    1.317   12.446    0.000
    Pragmatics          15.852    1.502   10.551    0.000

Covariances:
                      Estimate  Std.Err  z-value  P(>|z|)
  RuleAutomization ~~                                    
    FineExpertise        0.132    0.087    1.521    0.128

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Orthography      98.718    1.440   68.536    0.000
   .Phonetics        98.901    1.385   71.416    0.000
   .Morphology       98.885    1.523   64.941    0.000
   .Syntax           97.602    1.656   58.941    0.000
   .Semantics        99.508    1.432   69.472    0.000
   .Pragmatics       99.042    1.598   61.986    0.000
    RuleAutomizatn    0.000                           
    FineExpertise     0.000                           

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    RuleAutomizatn    1.000                           
    FineExpertise     1.000                           
   .Orthography     137.060   30.360    4.514    0.000
   .Phonetics       236.201   27.782    8.502    0.000
   .Morphology      136.837   34.755    3.937    0.000
   .Syntax          183.432   31.722    5.782    0.000
   .Semantics       115.175   23.894    4.820    0.000
   .Pragmatics      226.120   30.414    7.435    0.000
library(semTools)
summary(compareFit(fitMIM,fitMIC))
################### Nested Model Comparison #########################
Chi-Squared Difference Test

       Df   AIC   BIC  Chisq Chisq diff Df diff Pr(>Chisq)
fitMIC 16 18013 18159 13.914                              
fitMIM 22 18009 18132 21.783     7.8691       6     0.2478

####################### Model Fit Indices ###########################

################## Differences in Fit Indices #######################
NA

Study 5

In the last step of data collection, Practitioners created an experimental study to investigate the cognitive processing of different words. In particular, they focused on the processing of cognates versus non-cognates. Cognates are words that have a common etymological origin as words in other languages (e.g., English: starve and German: sterben). Practitioners randomly choose 100 participants from their original sample and presented them with a lexical decision task. In this task, participants are presented with a string of letters, and they are asked to indicate whether this string of letters is a real word (e.g., starve) versus a non-word (e.g., snepd) as quickly and as accurately as possible. Each participant was presented with 100 strings, half of which were words. More importantly, practitioners manipulated cognate status within those 50 words, thus 25 words were cognates and 25 words were non-cognates. In the case of this analysis, Practitioners would like you to focus only on the speed of processing of these real words, measured as a reaction time in milliseconds. Smaller reaction times indicate faster processing, which is potentially important cognitive status of cognates. Can you analyse this experimental data and inform Practitioners whether cognates elicit faster processing times in comparison to the non-cognate words?
In the second dataset (LexicalDec), you can find all the collected information. Column ID indicates repeated measurements for each subject, item_id indicates repeated measurements for each item, WordType shows whether words were cognates or non-cognates, and RT (reaction time) shows how fast participants answered the question about whether the presented string of letters was a word or a non-word. Practitioners also added other information to this dataset, such as mother and father’s vocabulary size, time spent reading and estimated vocabulary size while our participants were toddlers. Using this data, build a model that accounts for the repeated measurements (clustered observations) and tests the effects of type of words. You can also use additional measures to control for the effects of vocabulary size and time spent reading.

Question:

Specify the statistical model that tests the effect of type of words on reaction time.

library(lme4)
#install.packages("lmerTest")
library(lmerTest)
#install.packages("sjPlot")
require(sjPlot)
#install.packages("glmmTMB")
library(glmmTMB)
#install.packages("merTools")
library('merTools')
#install.packages("jtools")
library(jtools)

ggplot(data = LexicalDec, aes(item_id, RT, group=ID))+geom_line()+geom_point()

# Null model
m_0 = lmer(RT~1 + (1|ID) + (1|item_id), REML= FALSE, data=LexicalDec)
summary(m_0) #checking summary
Linear mixed model fit by maximum likelihood . t-tests use
  Satterthwaite's method [lmerModLmerTest]
Formula: RT ~ 1 + (1 | ID) + (1 | item_id)
   Data: LexicalDec

     AIC      BIC   logLik deviance df.resid 
 67966.1  67992.2 -33979.1  67958.1     4996 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.7334 -0.6734 -0.0033  0.6576  3.4361 

Random effects:
 Groups   Name        Variance Std.Dev.
 ID       (Intercept) 403733   635.40  
 item_id  (Intercept)   5531    74.37  
 Residual              40255   200.64  
Number of obs: 5000, groups:  ID, 100; item_id, 50

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  2881.20      64.47  105.36   44.69   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
confint(m_0) #checking confidence intervals
Computing profile confidence intervals ...
                2.5 %    97.5 %
.sig01       556.3987  734.9580
.sig02        60.7480   93.0611
.sigma       196.7086  204.6952
(Intercept) 2753.6867 3008.7185
summ(m_0)    #checking lmer summary
MODEL INFO:
Observations: 5000
Dependent Variable: RT
Type: Mixed effects linear regression 

MODEL FIT:
AIC = 67966.11, BIC = 67992.18
Pseudo-R² (fixed effects) = 0.00
Pseudo-R² (total) = 0.91 

FIXED EFFECTS:
------------------------------------------------------------
                       Est.    S.E.   t val.     d.f.      p
----------------- --------- ------- -------- -------- ------
(Intercept)         2881.20   64.47    44.69   105.36   0.00
------------------------------------------------------------

p values calculated using Satterthwaite d.f.

RANDOM EFFECTS:
------------------------------------
  Group      Parameter    Std. Dev. 
---------- ------------- -----------
    ID      (Intercept)    635.40   
 item_id    (Intercept)     74.37   
 Residual                  200.64   
------------------------------------

Grouping variables:
---------------------------
  Group    # groups   ICC  
--------- ---------- ------
   ID        100      0.90 
 item_id      50      0.01 
---------------------------
ranova(m_0)  #implementing ranova
ANOVA-like table for random-effects: Single term deletions

Model:
RT ~ (1 | ID) + (1 | item_id)
              npar logLik   AIC     LRT Df Pr(>Chisq)    
<none>           4 -33979 67966                          
(1 | ID)         3 -39632 79271 11306.8  1  < 2.2e-16 ***
(1 | item_id)    3 -34229 68463   499.3  1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Model 1
m_1 = lmer(RT~1 + WordType  + (1|ID) + (1|item_id), REML= FALSE, data=LexicalDec)
summary(m_1)
Linear mixed model fit by maximum likelihood . t-tests use
  Satterthwaite's method [lmerModLmerTest]
Formula: RT ~ 1 + WordType + (1 | ID) + (1 | item_id)
   Data: LexicalDec

     AIC      BIC   logLik deviance df.resid 
 67960.6  67993.1 -33975.3  67950.6     4995 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.7225 -0.6691 -0.0032  0.6608  3.4429 

Random effects:
 Groups   Name        Variance Std.Dev.
 ID       (Intercept) 403722   635.39  
 item_id  (Intercept)   4684    68.44  
 Residual              40255   200.64  
Number of obs: 5000, groups:  ID, 100; item_id, 50

Fixed effects:
                     Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)           2852.38      65.12  109.33  43.802  < 2e-16 ***
WordTypenon-cognates    57.64      20.17   49.05   2.857  0.00626 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr)
WrdTypnn-cg -0.155
confint(m_1)
Computing profile confidence intervals ...
                          2.5 %     97.5 %
.sig01                556.38816  734.94310
.sig02                 55.73392   85.83507
.sigma                196.70861  204.69519
(Intercept)          2723.61925 2981.15034
WordTypenon-cognates   17.31108   97.96025
summ(m_1)
MODEL INFO:
Observations: 5000
Dependent Variable: RT
Type: Mixed effects linear regression 

MODEL FIT:
AIC = 67960.56, BIC = 67993.15
Pseudo-R² (fixed effects) = 0.00
Pseudo-R² (total) = 0.91 

FIXED EFFECTS:
---------------------------------------------------------------------
                                Est.    S.E.   t val.     d.f.      p
-------------------------- --------- ------- -------- -------- ------
(Intercept)                  2852.38   65.12    43.80   109.33   0.00
WordTypenon-cognates           57.64   20.17     2.86    49.05   0.01
---------------------------------------------------------------------

p values calculated using Satterthwaite d.f.

RANDOM EFFECTS:
------------------------------------
  Group      Parameter    Std. Dev. 
---------- ------------- -----------
    ID      (Intercept)    635.39   
 item_id    (Intercept)     68.44   
 Residual                  200.64   
------------------------------------

Grouping variables:
---------------------------
  Group    # groups   ICC  
--------- ---------- ------
   ID        100      0.90 
 item_id      50      0.01 
---------------------------
ranova(m_1)
ANOVA-like table for random-effects: Single term deletions

Model:
RT ~ WordType + (1 | ID) + (1 | item_id)
              npar logLik   AIC     LRT Df Pr(>Chisq)    
<none>           5 -33975 67961                          
(1 | ID)         4 -39629 79265 11306.6  1  < 2.2e-16 ***
(1 | item_id)    4 -34183 68374   415.3  1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#ICC
ICC(outcome = "RT", group = c("ID", "item_id"), data = LexicalDec)
Using formula(x) is deprecated when x is a character vector of length > 1.
  Consider formula(paste(x, collapse = " ")) instead.
[1] 0.8990089
# Model 2
m_2 = lmer(RT~1 + WordType + ReadingTime + (1|ID) + (1|item_id),REML= FALSE, data=LexicalDec)
summary(m_2)
Linear mixed model fit by maximum likelihood . t-tests use
  Satterthwaite's method [lmerModLmerTest]
Formula: RT ~ 1 + WordType + ReadingTime + (1 | ID) + (1 | item_id)
   Data: LexicalDec

     AIC      BIC   logLik deviance df.resid 
 67958.3  67997.4 -33973.2  67946.3     4994 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.7231 -0.6698 -0.0031  0.6610  3.4434 

Random effects:
 Groups   Name        Variance Std.Dev.
 ID       (Intercept) 386888   622.00  
 item_id  (Intercept)   4684    68.44  
 Residual              40255   200.64  
Number of obs: 5000, groups:  ID, 100; item_id, 50

Fixed effects:
                     Estimate Std. Error       df t value Pr(>|t|)
(Intercept)          2175.653    330.970  100.356   6.574 2.23e-09
WordTypenon-cognates   57.636     20.172   49.050   2.857  0.00625
ReadingTime             6.765      3.246   99.999   2.084  0.03973
                        
(Intercept)          ***
WordTypenon-cognates ** 
ReadingTime          *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) WrdTy-
WrdTypnn-cg -0.030       
ReadingTime -0.981  0.000
confint(m_2)
Computing profile confidence intervals ...
                            2.5 %     97.5 %
.sig01                544.6573875  719.46486
.sig02                 55.7334700   85.83360
.sigma                196.7086156  204.69519
(Intercept)          1520.7055704 2830.59964
WordTypenon-cognates   17.3115821   97.95975
ReadingTime             0.3404082   13.18931
summ(m_2)
MODEL INFO:
Observations: 5000
Dependent Variable: RT
Type: Mixed effects linear regression 

MODEL FIT:
AIC = 67958.31, BIC = 67997.41
Pseudo-R² (fixed effects) = 0.04
Pseudo-R² (total) = 0.91 

FIXED EFFECTS:
----------------------------------------------------------------------
                                Est.     S.E.   t val.     d.f.      p
-------------------------- --------- -------- -------- -------- ------
(Intercept)                  2175.65   330.97     6.57   100.36   0.00
WordTypenon-cognates           57.64    20.17     2.86    49.05   0.01
ReadingTime                     6.76     3.25     2.08   100.00   0.04
----------------------------------------------------------------------

p values calculated using Satterthwaite d.f.

RANDOM EFFECTS:
------------------------------------
  Group      Parameter    Std. Dev. 
---------- ------------- -----------
    ID      (Intercept)    622.00   
 item_id    (Intercept)     68.44   
 Residual                  200.64   
------------------------------------

Grouping variables:
---------------------------
  Group    # groups   ICC  
--------- ---------- ------
   ID        100      0.90 
 item_id      50      0.01 
---------------------------
ranova(m_2)
ANOVA-like table for random-effects: Single term deletions

Model:
RT ~ WordType + ReadingTime + (1 | ID) + (1 | item_id)
              npar logLik   AIC     LRT Df Pr(>Chisq)    
<none>           6 -33973 67958                          
(1 | ID)         5 -39533 79076 11119.3  1  < 2.2e-16 ***
(1 | item_id)    5 -34181 68372   415.3  1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Model 3
m_3 = lmer(RT~1 + WordType + ReadingTime + ToddlerVocab + (1|ID) + (1|item_id),REML= FALSE, data=LexicalDec)
summary(m_3)
Linear mixed model fit by maximum likelihood . t-tests use
  Satterthwaite's method [lmerModLmerTest]
Formula: 
RT ~ 1 + WordType + ReadingTime + ToddlerVocab + (1 | ID) + (1 |  
    item_id)
   Data: LexicalDec

     AIC      BIC   logLik deviance df.resid 
 67606.5  67652.1 -33796.2  67592.5     4993 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.6367 -0.6657  0.0035  0.6595  3.5166 

Random effects:
 Groups   Name        Variance Std.Dev.
 ID       (Intercept) 10447    102.21  
 item_id  (Intercept)  4636     68.09  
 Residual             40255    200.64  
Number of obs: 5000, groups:  ID, 100; item_id, 50

Fixed effects:
                      Estimate Std. Error        df t value Pr(>|t|)
(Intercept)          3953.7210    65.6668  108.8113  60.209   <2e-16
WordTypenon-cognates   57.6357    20.0774   49.7399   2.871    0.006
ReadingTime            14.3486     0.5684   99.7945  25.244   <2e-16
ToddlerVocab          -38.6674     0.6686   99.7945 -57.838   <2e-16
                        
(Intercept)          ***
WordTypenon-cognates ** 
ReadingTime          ***
ToddlerVocab         ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) WrdTy- RdngTm
WrdTypnn-cg -0.153              
ReadingTime -0.712  0.000       
ToddlerVocb -0.468  0.000 -0.231
confint(m_3)
Computing profile confidence intervals ...
                          2.5 %     97.5 %
.sig01                 88.46066  119.35554
.sig02                 55.50986   85.25948
.sigma                196.70934  204.69601
(Intercept)          3823.87012 4083.57184
WordTypenon-cognates   17.51276   97.75857
ReadingTime            13.22373   15.47341
ToddlerVocab          -39.99049  -37.34439
# Model 4
m_4 = lmer(RT~1 + WordType + ReadingTime + ToddlerVocab + (1|ID) + (1|item_id) + MotherVocab,REML= FALSE, data=LexicalDec)
summary(m_4)
Linear mixed model fit by maximum likelihood . t-tests use
  Satterthwaite's method [lmerModLmerTest]
Formula: 
RT ~ 1 + WordType + ReadingTime + ToddlerVocab + (1 | ID) + (1 |  
    item_id) + MotherVocab
   Data: LexicalDec

     AIC      BIC   logLik deviance df.resid 
 67607.0  67659.1 -33795.5  67591.0     4992 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.6327 -0.6654  0.0029  0.6605  3.5196 

Random effects:
 Groups   Name        Variance Std.Dev.
 ID       (Intercept) 10279    101.38  
 item_id  (Intercept)  4636     68.09  
 Residual             40255    200.64  
Number of obs: 5000, groups:  ID, 100; item_id, 50

Fixed effects:
                      Estimate Std. Error        df t value Pr(>|t|)
(Intercept)          3910.1041    74.2082  106.8387  52.691   <2e-16
WordTypenon-cognates   57.6357    20.0766   49.7442   2.871    0.006
ReadingTime            14.1666     0.5832   99.7910  24.292   <2e-16
ToddlerVocab          -38.9630     0.7057   99.7910 -55.213   <2e-16
MotherVocab             0.8039     0.6533   99.7910   1.231    0.221
                        
(Intercept)          ***
WordTypenon-cognates ** 
ReadingTime          ***
ToddlerVocab         ***
MotherVocab             
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) WrdTy- RdngTm TddlrV
WrdTypnn-cg -0.135                     
ReadingTime -0.484  0.000              
ToddlerVocb -0.224  0.000 -0.124       
MotherVocab -0.478  0.000 -0.254 -0.340
anova(m_3, m_4)
Data: LexicalDec
Models:
m_3: RT ~ 1 + WordType + ReadingTime + ToddlerVocab + (1 | ID) + (1 | 
m_3:     item_id)
m_4: RT ~ 1 + WordType + ReadingTime + ToddlerVocab + (1 | ID) + (1 | 
m_4:     item_id) + MotherVocab
    npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)
m_3    7 67606 67652 -33796    67592                     
m_4    8 67607 67659 -33795    67591 1.5029  1     0.2202
summ(m_4)
MODEL INFO:
Observations: 5000
Dependent Variable: RT
Type: Mixed effects linear regression 

MODEL FIT:
AIC = 67606.99, BIC = 67659.13
Pseudo-R² (fixed effects) = 0.88
Pseudo-R² (total) = 0.91 

FIXED EFFECTS:
---------------------------------------------------------------------
                                Est.    S.E.   t val.     d.f.      p
-------------------------- --------- ------- -------- -------- ------
(Intercept)                  3910.10   74.21    52.69   106.84   0.00
WordTypenon-cognates           57.64   20.08     2.87    49.74   0.01
ReadingTime                    14.17    0.58    24.29    99.79   0.00
ToddlerVocab                  -38.96    0.71   -55.21    99.79   0.00
MotherVocab                     0.80    0.65     1.23    99.79   0.22
---------------------------------------------------------------------

p values calculated using Satterthwaite d.f.

RANDOM EFFECTS:
------------------------------------
  Group      Parameter    Std. Dev. 
---------- ------------- -----------
    ID      (Intercept)    101.38   
 item_id    (Intercept)     68.09   
 Residual                  200.64   
------------------------------------

Grouping variables:
---------------------------
  Group    # groups   ICC  
--------- ---------- ------
   ID        100      0.19 
 item_id      50      0.08 
---------------------------
ranova(m_4)
ANOVA-like table for random-effects: Single term deletions

Model:
RT ~ WordType + ReadingTime + ToddlerVocab + MotherVocab + (1 | ID) + (1 | item_id)
              npar logLik   AIC    LRT Df Pr(>Chisq)    
<none>           8 -33795 67607                         
(1 | ID)         7 -34231 68476 870.78  1  < 2.2e-16 ***
(1 | item_id)    7 -34003 68020 414.72  1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Model 5
m_5 = lmer(RT~1 + WordType + ReadingTime + ToddlerVocab + (1|ID) + (1|item_id) + MotherVocab + FatherVocab ,REML= FALSE, data=LexicalDec)
summary(m_5)
Linear mixed model fit by maximum likelihood . t-tests use
  Satterthwaite's method [lmerModLmerTest]
Formula: 
RT ~ 1 + WordType + ReadingTime + ToddlerVocab + (1 | ID) + (1 |  
    item_id) + MotherVocab + FatherVocab
   Data: LexicalDec

     AIC      BIC   logLik deviance df.resid 
 67607.4  67666.1 -33794.7  67589.4     4991 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.6320 -0.6665  0.0038  0.6583  3.5294 

Random effects:
 Groups   Name        Variance Std.Dev.
 ID       (Intercept) 10104    100.52  
 item_id  (Intercept)  4634     68.07  
 Residual             40255    200.64  
Number of obs: 5000, groups:  ID, 100; item_id, 50

Fixed effects:
                      Estimate Std. Error        df t value Pr(>|t|)
(Intercept)          3956.4526    82.2228  105.5126  48.119  < 2e-16
WordTypenon-cognates   57.6357    20.0733   49.7659   2.871  0.00599
ReadingTime            14.2745     0.5848   99.7827  24.410  < 2e-16
ToddlerVocab          -38.9139     0.7012   99.7827 -55.499  < 2e-16
MotherVocab             0.9275     0.6554   99.7827   1.415  0.16012
FatherVocab            -0.7166     0.5655   99.7827  -1.267  0.20803
                        
(Intercept)          ***
WordTypenon-cognates ** 
ReadingTime          ***
ToddlerVocab         ***
MotherVocab             
FatherVocab             
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) WrdTy- RdngTm TddlrV MthrVc
WrdTypnn-cg -0.122                            
ReadingTime -0.364  0.000                     
ToddlerVocb -0.176  0.000 -0.114              
MotherVocab -0.357  0.000 -0.226 -0.328       
FatherVocab -0.445  0.000 -0.146 -0.055 -0.149
summ(m_5)
MODEL INFO:
Observations: 5000
Dependent Variable: RT
Type: Mixed effects linear regression 

MODEL FIT:
AIC = 67607.40, BIC = 67666.05
Pseudo-R² (fixed effects) = 0.88
Pseudo-R² (total) = 0.91 

FIXED EFFECTS:
---------------------------------------------------------------------
                                Est.    S.E.   t val.     d.f.      p
-------------------------- --------- ------- -------- -------- ------
(Intercept)                  3956.45   82.22    48.12   105.51   0.00
WordTypenon-cognates           57.64   20.07     2.87    49.77   0.01
ReadingTime                    14.27    0.58    24.41    99.78   0.00
ToddlerVocab                  -38.91    0.70   -55.50    99.78   0.00
MotherVocab                     0.93    0.66     1.42    99.78   0.16
FatherVocab                    -0.72    0.57    -1.27    99.78   0.21
---------------------------------------------------------------------

p values calculated using Satterthwaite d.f.

RANDOM EFFECTS:
------------------------------------
  Group      Parameter    Std. Dev. 
---------- ------------- -----------
    ID      (Intercept)    100.52   
 item_id    (Intercept)     68.07   
 Residual                  200.64   
------------------------------------

Grouping variables:
---------------------------
  Group    # groups   ICC  
--------- ---------- ------
   ID        100      0.18 
 item_id      50      0.08 
---------------------------
ranova(m_5)
ANOVA-like table for random-effects: Single term deletions

Model:
RT ~ WordType + ReadingTime + ToddlerVocab + MotherVocab + FatherVocab + (1 | ID) + (1 | item_id)
              npar logLik   AIC    LRT Df Pr(>Chisq)    
<none>           9 -33795 67607                         
(1 | ID)         8 -34222 68460 855.03  1  < 2.2e-16 ***
(1 | item_id)    8 -34002 68020 414.72  1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Selecting m_3 as final model
library("ggplot2")
plot_model(m_3, type='re', sort.est = 'sort.all', grid = FALSE)
[[1]]

[[2]]

# checking summary for model 3
summ(m_3)
MODEL INFO:
Observations: 5000
Dependent Variable: RT
Type: Mixed effects linear regression 

MODEL FIT:
AIC = 67606.49, BIC = 67652.11
Pseudo-R² (fixed effects) = 0.88
Pseudo-R² (total) = 0.91 

FIXED EFFECTS:
---------------------------------------------------------------------
                                Est.    S.E.   t val.     d.f.      p
-------------------------- --------- ------- -------- -------- ------
(Intercept)                  3953.72   65.67    60.21   108.81   0.00
WordTypenon-cognates           57.64   20.08     2.87    49.74   0.01
ReadingTime                    14.35    0.57    25.24    99.79   0.00
ToddlerVocab                  -38.67    0.67   -57.84    99.79   0.00
---------------------------------------------------------------------

p values calculated using Satterthwaite d.f.

RANDOM EFFECTS:
------------------------------------
  Group      Parameter    Std. Dev. 
---------- ------------- -----------
    ID      (Intercept)    102.21   
 item_id    (Intercept)     68.09   
 Residual                  200.64   
------------------------------------

Grouping variables:
---------------------------
  Group    # groups   ICC  
--------- ---------- ------
   ID        100      0.19 
 item_id      50      0.08 
---------------------------
ranova(m_3)
ANOVA-like table for random-effects: Single term deletions

Model:
RT ~ WordType + ReadingTime + ToddlerVocab + (1 | ID) + (1 | item_id)
              npar logLik   AIC    LRT Df Pr(>Chisq)    
<none>           7 -33796 67606                         
(1 | ID)         6 -34239 68490 885.84  1  < 2.2e-16 ***
(1 | item_id)    6 -34004 68019 414.73  1  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#checking coeff
coef(summary(m_3))
                       Estimate Std. Error        df    t value
(Intercept)          3953.72098 65.6668271 108.81135  60.208802
WordTypenon-cognates   57.63567 20.0774070  49.73991   2.870673
ReadingTime            14.34857  0.5683937  99.79450  25.244069
ToddlerVocab          -38.66744  0.6685519  99.79450 -57.837603
                         Pr(>|t|)
(Intercept)          2.236411e-85
WordTypenon-cognates 6.001178e-03
ReadingTime          3.996799e-45
ToddlerVocab         1.450679e-78
VarCorr(m_3)
 Groups   Name        Std.Dev.
 ID       (Intercept) 102.21  
 item_id  (Intercept)  68.09  
 Residual             200.64  
# converting to str to plot 
str(resid(m_3))
 Named num [1:5000] 372.1 -28.3 -131.8 -118.2 383.6 ...
 - attr(*, "names")= chr [1:5000] "1" "2" "3" "4" ...
plot(fitted(m_3), resid(m_3, type = "pearson"))# this will create the plot
abline(0,0, col="red")

qqnorm(resid(m_3)) 
qqline(resid(m_3), col = "red") # add a perfect fit line

# lattice plot
require(lattice)
dotplot(ranef(m_3, condVar=TRUE))
$ID

$item_id

print(dotplot(ranef(m_3,condVar=TRUE))$ID)

# Checking random effect
ranef(m_3)
$ID
    (Intercept)
1     87.328278
2   -208.904903
3   -161.237560
4    -46.901925
5   -111.934462
6      7.856438
7   -138.784779
8     38.104224
9     -2.368296
10    89.499938
11   -13.754364
12    20.388692
13  -115.725859
14    90.686380
15   -51.269624
16   -96.077004
17    68.752790
18   -80.415214
19   -47.463353
20  -101.902420
21   150.960177
22  -117.644438
23   -62.626163
24   -41.394581
25   -70.216147
26   -88.693603
27   -34.023706
28   148.729911
29    51.649228
30   -31.129323
31    -5.383212
32   -76.169461
33    54.497236
34  -140.451563
35   -90.479793
36   -44.071895
37    75.675836
38   -35.497309
39   202.528232
40    60.824253
41   -75.609437
42  -167.259251
43   -99.027144
44   -87.502683
45  -107.197561
46   135.413083
47   -28.446581
48   -20.503406
49   147.998023
50    -5.103535
51    95.952306
52    31.591750
53  -125.699439
54    13.029189
55    47.698899
56    34.670904
57   112.540171
58   178.926736
59   -43.193228
60  -115.798347
61    33.716077
62    63.917410
63    27.264064
64   102.563368
65    31.070773
66   126.417584
67   -73.400204
68   -92.183050
69   -99.868778
70  -220.165740
71    34.680973
72   125.130561
73    54.924391
74    64.899875
75    39.784875
76    96.318040
77   -23.716069
78   -20.452357
79   159.209795
80    20.454574
81   106.802145
82   -83.055083
83  -107.702704
84    17.598829
85    23.859126
86   191.101601
87  -105.389097
88  -121.795204
89   -58.518641
90   -92.531620
91    56.290353
92    53.002060
93    18.288573
94   216.366569
95   123.309459
96   -96.321065
97    92.384472
98   188.794983
99   -73.197391
100  144.705369

$item_id
   (Intercept)
1     3.610568
2   123.707088
3    16.738904
4   -74.435123
5   169.304426
6   -46.181027
7    36.842658
8  -149.985396
9  -118.650710
10  -72.443667
11   -6.916457
12   41.732867
13  -47.557196
14  -35.173920
15   69.156742
16   37.897748
17   21.562942
18  -23.225126
19  -35.934260
20  -98.094144
21   80.107167
22  -26.220287
23   18.994541
24   16.783913
25   98.377748
26 -113.228475
27    9.297202
28    6.785464
29  -52.200052
30   60.659136
31   67.480371
32   78.703512
33   40.913624
34  -38.158938
35   -2.661941
36   -7.538214
37  -71.870956
38  -22.230771
39  -74.280879
40  -34.590424
41  -22.405845
42    7.115940
43   80.500682
44    7.836681
45  104.969458
46  -57.977259
47   67.626071
48   27.020291
49  -10.910254
50  -50.854423

with conditional variances for “ID” “item_id” 
#install.packages("MuMIn")
require(MuMIn)
r.squaredGLMM(m_3)
           R2m       R2c
[1,] 0.8768848 0.9104415

I’ve used mixed effect model with partial pooling, as the data was multilevel and the ICC score gave substantial evidence of clustering. I used 3 predictors (WordType, ReadingTime and ToddlerVocab) and added 2 random effects (participant ID and itemID). As the data points varied a lot along this axis, I assumed that the main source of a random variation is Participant ID. I added the random effect of item_id too. Additionally, with iterative process I chose a model that controlled for the effect of ReadingTime and ToddlerVocab to measure the effect of WordType.
I considered the listed hypothesis to evaluate the models-
H0: No difference exists
Ha: A difference exists
Formula:
mult1 = lmer(RT~WordType + ReadingTime + ToddlerVocab + (1|ID) + (1|item_id), data=LexicalDec)


---
title: "Advanced Statistics- Investigation of knowledge and skill development in a lifetime"
author: Farzana Patel
output: html_notebook
---


## Background

  A group of researchers at the University of Sheffield, self-named “Practitioners”, investigate how people develop skills and knowledge throughout their lifetime. They are interested in understanding skill acquisition and focus on collecting and analysing data that can reveal underlying factors that influence this process. In a set of studies, Practitioners focused on the acquisition of language skills in toddlers. In particular, they investigated how language exposure impacts later linguistic skills, cognitive abilities, and academic achievement. In this knowledge assessment, you are part of the Practitioners team. The goal is to make several models, which quantify and test postulated theoretical assumptions.
Previous studies in language acquisition showed that language skills depend on the richness of the environment as well as number of other factors. In the case of this exam, we will focus on the question of how people learn language and whether this influences other outcomes, such as university enrollment.  

```{r setup, include=TRUE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggplot2)
library(tidyr)
require('tidySEM')
require(lavaan)
library(statmod)
library(semPlot)
```

```{r}
# Checking summary of dataframes
summary(LexicalDec)
summary(Lexicon)
```

## Study 1

  In the case of the first study, Practitioners would like to re-analyse previously collected data. This data set collects measures of vocabulary size - approximation of how many English words each toddler knows, socioeconomic status of the family in which the toddlers are growing up (high and low), father and mother’s vocabulary size, and time that parents spend reading to toddlers. They formed several hypothetical assumptions:

## Question : 

Specify and estimate a linear model that tests for the research questions postulated by the “Practitioners” team. 


```{r}
model1 <- lm(ToddlerVocab ~ MotherVocab, data = Lexicon)
summary(model1)

model2 <- lm(ToddlerVocab ~ FatherVocab, data = Lexicon)
summary(model2)

#First model
model3 <- lm(ToddlerVocab ~ MotherVocab*FatherVocab, data = Lexicon)
summary(model3)

#Second model
modelvoc <- lm(ToddlerVocab ~ ReadingTime, data = Lexicon)
summary(modelvoc)

#Third model
model_ses <- lm(ToddlerVocab ~ ReadingTime:SES, data= Lexicon)
summary(model_ses)

# Checking model differences
anova(modelvoc,model_ses)

# Checking model differences
anova(model3,modelvoc,model_ses)
```

```{r}
#Final model
model_ses_final <- lm(ToddlerVocab ~ MotherVocab + FatherVocab + ReadingTime*SES,  data=Lexicon)
summary(model_ses_final)

```
#### Plotting the model

```{r }
plot(model_ses_final)
```

> A linear model was suitable here because the hypothesized relationship suggested linear relationship between predictors and the response. I took an iterative approach to check models by adding variables in every model and compare the significance to keep the ones that explains variance significantly.   
  Model1  
 model3 <- lm(ToddlerVocab ~ MotherVocab*FatherVocab, data = Lexicon)  
 I decided to include MotherVocab and FatherVocab as the hypothesis stated that toddlers with higher vocabulary sizes will also have parents with stronger linguistic abilities. The main effect plus interaction does explain a significant of variance.   
  Model2  
 modelvoc <- lm(ToddlerVocab ~ ReadingTime, data = Lexicon)  
 As toddlers that might spend more time reading will have higher vocabulary size, I checked how the main effect of ReadingTime.  
  Model 3  
 model_ses <- lm(ToddlerVocab ~ ReadingTime:SES, data= Lexicon)  
 As the hypothesis stated effect of Reading time will depend on SES levels of the toddlers, I checked out the main effect and interaction terms of terms. Here, effect of reading time (ReadingTime) will be moderated by socioeconomic status (SES), where toddlers in more affluent families benefit more from reading time.    
  Final Model    
 lm(ToddlerVocab ~ MotherVocab + FatherVocab + ReadingTime*SES,  data=Lexicon)    
 Hence, I decided to include main effect of MotherVocab and FatherVocab as well as, main and iterative effect of ReadingTime and SES in tandem with the hypothesized relationships along with the theoretical support of above model evaluations.  The interaction effect will let me check if the relationship changes depending on the level of SES, that will in turn test the hypothesis if relationship between ReadingTime, ToddlerVocab will be higher for toddlers with high SES. 

***

## Study 2

The Practitioners team collected additional information for the next three research questions. The toddlers from the original study are now young adults, and Practitioners contacted them to collect additional information, such as measures of verbal abilities (verbal intelligence), reading comprehension performance, and finally, whether they enrolled at the University. For the second research project, Practitioners would like to focus on testing predictions from two theoretical frameworks that argue for nature or nurture mechanistic pathways in learning verbal behaviour.
The Verbal nurturing theory states that practice is the main important factor, where reading time influences vocabulary size, which results in better reading comprehension outcomes (see Figure 1). The claims from the verbal nurturing theory are illustrated using a direct pathway c, from Reading time to Reading comprehension and indirect pathways a (from Reading time to Vocabulary size) and b (from Vocabulary size to Reading Comprehension).    
The Verbal nature theory assumes individual differences in the verbal capabilities of people. It argues that verbal talent (Verbal ability) is the main factor that influences Vocabulary size and consequently results in better reading comprehension outcomes (see Figure 1). The claims from Verbal nature theory are illustrated using direct pathway e (from Verbal ability to Reading comprehension), and indirect pathways d (from Verbal ability to Vocabulary size) and b (from Vocabulary size to Reading comprehension). The Verbal nature theory also assumes that Verbal ability directly influences Reading time, but they do not expect the existence of any indirect effects over Reading time. This is illustrated as a pathway k.  
Practitioners would like to specify a path model that tests for the proposed pathways by the two theories. To be able to do so, you need to specify two indirect effects: indirect effect 1 for Verbal nurturing theory and indirect effect 2 for Verbal nature theory, as well as two total effects, that is combinations of relevant direct and indirect effects.

### Question : 
Specify and estimate a path model that tests the proposed pathways postulated by the “Practitioners” team. 

```{r}
library(tidyverse)
library(ggplot2)
library(tidyr)
require('tidySEM')
require(lavaan)
library(statmod)
library(semPlot)
require('tidySEM')
path_model1 ="
ToddlerVocab ~ a*ReadingTime
ReadingComprehension ~ b*ToddlerVocab
ReadingComprehension ~ c*ReadingTime
ToddlerVocab ~ d*VerbalAbility
ReadingComprehension ~ e*VerbalAbility
ReadingTime ~ k*VerbalAbility
ind1 := a*b
tot1 := c+(a*b)
ind2 := d*b
tot2 := e+(d*b)
"
fit_path_model= sem(path_model1, data = Lexicon)
summary(fit_path_model, fit.measures=T, standardized=T)
```
```{r}
#plotting path model
graph_sem(fit_path_model, variance_diameter=.2)
```
```{r}
# Checking summary
summary(fit_path_model, fit.measures=T, standardized=T)
```

```{r}
# Checking if models are identical
identical(summary(fit_path_model),summary(fit_path_model, fit.measures=T, standardized=T))

```
```{r, fig.align='center'}
# Making path model 2
path_model2 ="
ToddlerVocab ~ a*ReadingTime
ReadingComprehension ~ b*ToddlerVocab
ReadingComprehension ~ c*ReadingTime
ind1 := a*b
tot1 := c+(a*b)
"
fit_path_model2= sem(path_model2, data = Lexicon)
graph_sem(fit_path_model2, variance_diameter=.2)

```
```{r}
# Checking summary of model 2
summary(fit_path_model2, fit.measures=T, standardized=T)
```

```{r}
#install.packages("statmod")
library(statmod)

#install.packages("semPlot")
library(semPlot)

semPaths(fit_path_model2, 'model','est', edge.label.cex = 1.1)
```
```{r}
# Creating path model 3
path_model3 ="
ToddlerVocab ~ a*ReadingTime
ReadingComprehension ~ b*ToddlerVocab
ReadingComprehension ~ c*ReadingTime
ToddlerVocab ~ 0*VerbalAbility
ReadingComprehension ~ e*VerbalAbility
ReadingTime ~ k*VerbalAbility
ind1 := a*b
tot1 := c+(a*b)
ind2 := 0*b
tot2 := e+(0*b)
"
fit_path_model3= sem(path_model3, data = Lexicon)
summary(fit_path_model3, fit.measures=T, standardized=T)
```

```{r}
semPaths(fit_path_model3, 'model','est', edge.label.cex = 1.1)
```

```{r}
# Comparing models
require(semTools)
diff<-compareFit(fit_path_model,fit_path_model3)
summary(diff)
```
> I used structural equation model (SEM) as it evaluates complex connected relationships very well.   

***

## Study 3

For the third research project, Practitioners would like to focus on the information behind university enrollment. For each participant, they recorded whether they enrolled at University (1 – yes, 0 – no). Practitioners would like to build a model that tests which variables are predictive of enrollment status. They aim to use the initially collected variables from when the participants were toddlers, such as mothers’, fathers’ and toddlers’ vocabulary size, as well as reading time and socioeconomic status. If this type of model works well to predict distal outcomes, such as university enrollment, we can use this information and improve behaviour that leads to more positive outcomes (enrolling at the university).    

### Question : 

Specify the statistical model that tests which factors are predictive of University enrollment.  
```{r}
# Fixing the non-normal datapoints
Q <- quantile(Lexicon$FatherVocab, probs = c(.25,.75), na.rm = FALSE)
iqr<- IQR(Lexicon$FatherVocab)
up<- Q[2]+1.5*iqr
down<- Q[1]-1.5*iqr

Lexi_normal<- subset(Lexicon, Lexicon$FatherVocab > (Q[1]-1.5*iqr) & Lexicon$FatherVocab < (Q[2]+1.5*iqr))

```

```{r}
# Define full and null models and do step procedure
model.null = glm(Enrollment ~ 1,
                 data=Lexi_normal,
                 family = binomial(link="logit")
)
model.full = glm(Enrollment~SES + FatherVocab + MotherVocab + ToddlerVocab + ReadingTime + VerbalAbility + ReadingComprehension + Orthography + Morphology + Phonetics + Syntax + Pragmatics,
                 data=Lexi_normal,
                 family = binomial(link="logit")
)
step(model.null,
     scope = list(upper=model.full),
     direction="both",
     test="Chisq",
     data=Lexi_normal)
```

### Final model
```{r}
#Final model
model.final =glm(formula = Enrollment ~ SES + ReadingTime + ToddlerVocab, 
                 family = binomial(link = "logit"), data = Lexi_normal)

summary(model.final)
```

```{r}
#Checking for correlation
df_glm <- Lexi_normal[,c("ToddlerVocab", "ReadingTime" )]
library("PerformanceAnalytics")
chart.Correlation(df_glm, histogram=TRUE, pch=19)
```

```{r}
#Analysis of variance for individual terms
#install.packages("car")
library(car)
Anova(model.final, type="II", test="Wald")
```

```{r}
#Pseudo-R-squared
#install.packages("rcompanion")
library(rcompanion)
nagelkerke(model.final)
```

```{r}
#odds ratio
#install.packages("broom")
library(broom)
tidy(model.final, exponentiate = T, conf.int = T)

```
> The odds-ratio was >1, so having more SES, ReadingTime and ToddlerVocab increased the chance of enrollment.
 The confidence interval for SES, ReadingTime and ToddlerVocab does not includes 1.0 so the results are statistically significant.

```{r, fig.height=25}
Lexi_normal$probability = predict(model.final,newdata = NULL, type = 'response')
pairs.panels(Lexi_normal)
```

```{r}
Lexi_normal$probability <- ifelse(Lexi_normal$probability>0.5,1,0)
```

```{r}
# making a confusion matrix
cm<-table(col=Lexi_normal$Enrollment,Lexi_normal$probability)
print(cm)
```

```{r}
sum(diag(cm))/sum(colSums(cm))
#install.packages("caret")
library(caret)

# checking precision n f1-score
confusionMatrix(cm, mode = "prec_recall")
```

```{r}
#confidence intervals
exp(confint(model.final))
```
> I used generalized linear model with binomial family as the dependent variable (Enrollment) is binary and predictors (SES, ReadingTime and ToddlerVocab) confirm normality distribution. I’ve used a stepwise procedure using the step function that selects models to minimize AIC to arrive at this specification. As the predictors have low correlation (0.26), any order gives the same result.  


***

## Study 4

In their previous work, Practitioners developed a questionnaire to investigate levels of language expertise. The questionnaire collects six measures, each focusing on a specific level of language processing: orthography, phonetics, morphology, syntax, semantics, pragmatics. When investigating the latent structure of this questionnaire, Practitioners proposed a two-factor solution. The first factor was described as Rule automatization, with orthography, phonetics and morphology highly loaded on it. The second factor was described as Fine expertise with knowledge of syntax, semantics, and pragmatics highly loading on it.  
In the case of this research project, Practitioners used our participants to resample indices of language expertise. In other words, they collected additional measures of language performance using their questionnaire. In the case of the fourth study, they aimed to validate their proposed two-factor solution for this questionnaire and planned to run a confirmatory factor analysis on this data. Can you help them with this goal?  

### Question :

Specify and estimate a confirmatory factor model that tests the proposed latent structure. NOTE: use scaling of the variance to 1 for the latent factor structure.   

```{r}
library(lavaan)
library(psych)
scatter.hist(x=Lexicon$Orthography, y=Lexicon$Semantics, density=T, ellipse=T)

```

```{r}

fa1 ='
RuleAutomization=~ NA*Orthography + Phonetics + Morphology
FineExpertise=~ NA*Syntax + Semantics + Pragmatics
RuleAutomization ~~ 1*RuleAutomization
FineExpertise ~~ 1*FineExpertise
'

cfa1 = cfa(fa1, data=Lexicon)

summary(cfa1, standardize=TRUE, fit.measures=TRUE)

lavPredict(cfa1) 
```

```{r}
#Lexicon$probability <- predict(cfa1 , newdata = NULL, type = "response")

fitMIM = cfa(fa1, data=Lexicon, group = 'SES', group.equal='loadings')
summary(fitMIM)

```

```{r}
fitMIC <- cfa(fa1, data = Lexicon, group = 'SES') 
summary(fitMIC)
```

```{r}
library(semTools)
summary(compareFit(fitMIM,fitMIC))
```

***

## Study 5

In the last step of data collection, Practitioners created an experimental study to investigate the cognitive processing of different words. In particular, they focused on the processing of cognates versus non-cognates. Cognates are words that have a common etymological origin as words in other languages (e.g., English: starve and German: sterben). Practitioners randomly choose 100 participants from their original sample and presented them with a lexical decision task. In this task, participants are presented with a string of letters, and they are asked to indicate whether this string of letters is a real word (e.g., starve) versus a non-word (e.g., snepd) as quickly and as accurately as possible. Each participant was presented with 100 strings, half of which were words. More importantly, practitioners manipulated cognate status within those 50 words, thus 25 words were cognates and 25 words were non-cognates. In the case of this analysis, Practitioners would like you to focus only on the speed of processing of these real words, measured as a reaction time in milliseconds. Smaller reaction times indicate faster processing, which is potentially important cognitive status of cognates. Can you analyse this experimental data and inform Practitioners whether cognates elicit faster processing times in comparison to the non-cognate words?  
In the second dataset (LexicalDec), you can find all the collected information. Column ID indicates repeated measurements for each subject, item_id indicates repeated measurements for each item, WordType shows whether words were cognates or non-cognates, and RT (reaction time) shows how fast participants answered the question about whether the presented string of letters was a word or a non-word. Practitioners also added other information to this dataset, such as mother and father’s vocabulary size, time spent reading and estimated vocabulary size while our participants were toddlers. Using this data, build a model that accounts for the repeated measurements (clustered observations) and tests the effects of type of words. You can also use additional measures to control for the effects of vocabulary size and time spent reading. 

### Question: 

Specify the statistical model that tests the effect of type of words on reaction time.  
```{r}
library(lme4)
#install.packages("lmerTest")
library(lmerTest)
#install.packages("sjPlot")
require(sjPlot)
#install.packages("glmmTMB")
library(glmmTMB)
#install.packages("merTools")
library('merTools')
#install.packages("jtools")
library(jtools)

ggplot(data = LexicalDec, aes(item_id, RT, group=ID))+geom_line()+geom_point()

```

```{r}
# Null model
m_0 = lmer(RT~1 + (1|ID) + (1|item_id), REML= FALSE, data=LexicalDec)
summary(m_0) #checking summary
confint(m_0) #checking confidence intervals
summ(m_0)    #checking lmer summary
ranova(m_0)  #implementing ranova
```

```{r}
# Model 1
m_1 = lmer(RT~1 + WordType  + (1|ID) + (1|item_id), REML= FALSE, data=LexicalDec)
summary(m_1)
confint(m_1)
summ(m_1)
ranova(m_1)
```

```{r}
#ICC
ICC(outcome = "RT", group = c("ID", "item_id"), data = LexicalDec)
```

```{r}
# Model 2
m_2 = lmer(RT~1 + WordType + ReadingTime + (1|ID) + (1|item_id),REML= FALSE, data=LexicalDec)
summary(m_2)
confint(m_2)
summ(m_2)
ranova(m_2)
```

```{r}
# Model 3
m_3 = lmer(RT~1 + WordType + ReadingTime + ToddlerVocab + (1|ID) + (1|item_id),REML= FALSE, data=LexicalDec)
summary(m_3)
confint(m_3)
```
```{r}
# Model 4
m_4 = lmer(RT~1 + WordType + ReadingTime + ToddlerVocab + (1|ID) + (1|item_id) + MotherVocab,REML= FALSE, data=LexicalDec)
summary(m_4)
anova(m_3, m_4)
summ(m_4)
ranova(m_4)
```

```{r}
# Model 5
m_5 = lmer(RT~1 + WordType + ReadingTime + ToddlerVocab + (1|ID) + (1|item_id) + MotherVocab + FatherVocab ,REML= FALSE, data=LexicalDec)
summary(m_5)
summ(m_5)
ranova(m_5)
```

```{r}
# Selecting m_3 as final model
library("ggplot2")
plot_model(m_3, type='re', sort.est = 'sort.all', grid = FALSE)
```

```{r}
# checking summary for model 3
summ(m_3)
ranova(m_3)

#checking coeff
coef(summary(m_3))

VarCorr(m_3)

# converting to str to plot 
str(resid(m_3))
```

```{r}
plot(fitted(m_3), resid(m_3, type = "pearson"))# this will create the plot
abline(0,0, col="red")
```

```{r}
qqnorm(resid(m_3)) 
qqline(resid(m_3), col = "red") # add a perfect fit line
```

```{r}
# lattice plot
require(lattice)
dotplot(ranef(m_3, condVar=TRUE))
```

```{r}
print(dotplot(ranef(m_3,condVar=TRUE))$ID)
```

```{r}
# Checking random effect
ranef(m_3)

```

```{r}
#install.packages("MuMIn")
require(MuMIn)
r.squaredGLMM(m_3)
```
>I’ve used mixed effect model with partial pooling, as the data was multilevel and the ICC score gave substantial evidence of clustering. I used 3 predictors (WordType, ReadingTime and ToddlerVocab) and added 2 random effects (participant ID and itemID). As the data points varied a lot along this axis, I assumed that the main source of a random variation is Participant ID. I added the random effect of item_id too. Additionally, with iterative process I chose a model that controlled for the effect of ReadingTime and ToddlerVocab to measure the effect of WordType.  
I considered the listed hypothesis to evaluate the models-  
H0: No difference exists  
Ha: A difference exists  
Formula:  
mult1 = lmer(RT~WordType + ReadingTime + ToddlerVocab + (1|ID) + (1|item_id), data=LexicalDec)

***

```{r}

```









