Introduction

This report use the MacArthur-Bates Communicative Development Inventories (CDIs), a family of parent-report instruments about early language acquisition, provided by Wordbank, for exploring the relationship between comprehension and production of language and gesture development in infants. First section descriptive characteristics of sample used in the analysis. Second section, explore the relationship between comprehension and production of language and gesture development using a linear regression model and beta-regression model. Finally, third section presents the results of a logistic hierarchical regression model used to investigate whether exists a relationship between the children demographic characteristics and the likelihood to produce gestures between 8 and 18 months.

Methods

a. Participants

For current analysis, we focus on \(1.149\) native English infants (\(49.1\%\) Female, \(72.4\%\) White) ranged from 8 and 18 months of age (Mean = 13, SD=2.8) (Table 1)

Table 1 Descriptive statistics demographic information

Variable

Stats / Values

Freqs (% of Valid)

Missing

age [integer]

Mean (sd) : 13 (2.8)

min ≤ med ≤ max:

8 ≤ 13 ≤ 18

Q1 - Q3 : 11 - 15

11 distinct values

0 (0.0%)

sex [factor]

1. Female

2. Male

3. Other

552	(	49.9%	)
554	(	50.1%	)
0	(	0.0%	)

43 (3.7%)

birth_order [factor]

1. First

2. Second

3. Third

4. Fourth

5. Fifth

6. Sixth

7. Seventh

8. Eighth

525	(	50.1%	)
337	(	32.2%	)
134	(	12.8%	)
37	(	3.5%	)
6	(	0.6%	)
7	(	0.7%	)
1	(	0.1%	)
0	(	0.0%	)

102 (8.9%)

ethnicity [factor]

1. Asian

2. Black

3. Other

4. White

5. Hispanic

46	(	4.3%	)
124	(	11.6%	)
62	(	5.8%	)
774	(	72.5%	)
61	(	5.7%	)

82 (7.1%)

mom_ed [factor]

1. None

2. Primary

3. Some Secondary

4. Secondary

5. Some College

6. College

7. Some Graduate

8. Graduate

0	(	0.0%	)
2	(	0.2%	)
69	(	6.5%	)
254	(	23.8%	)
267	(	25.0%	)
294	(	27.5%	)
41	(	3.8%	)
141	(	13.2%	)

81 (7.0%)

Generated by summarytools 1.0.0 (R version 4.1.2)
2021-11-12

Measures

a. Language acquisition

Data set includes the total scores of child comprehension and production, which measures the parent report of a representative sample of words from many different semantic (e.g. animal names, household items) and syntactic (e.g. action words, connectives) that children current understand and says, respectively.

b. Gestures development

Data set includes five sub-scales about gestures development: first gestures (\(12\) items), adult gestures (\(15\) items), parent gestures (\(13\) items), object gestures (\(17\) items) and games gestures (\(6\) items). Scores for each sub-scale was modeling using a unidimensional multiple-group IRT model based on the two-parameter logistic model (2PLM) for the binary item responses and the Graded Response Model (GRM) for the polytomous item responses. The 2PLM is a generalization of the Rasch model, which assumes that the probability of a correct response to item i depends only on the difference between the student v’s trait level \(\theta_{v}\) and the difficulty of the item \(b_{i}\). In addition, the 2PLM postulates that for every item, the association between this difference and the response probability depends on an additional item discrimination parameter \(a_{i}\). The discrimination parameter describes how well a certain item relates to the latent trait and, therefore, discriminates between children with different trait levels compared to other items on the test:

\[ Pr(x_{vi} = 1 | \theta_{v}, b_{i}, a_{i}) = \frac{exp(Da_{i}(\theta_{v}-b_{i})}{1 + exp(Da_{i}(\theta_{v}-b_{i})} \] The GRM, like the 2PLM , is a mathematical model for the probability that an individual will respond certain response category (k) on a particular item appropriate for the polytomous and ordinal nature of the items. The GRM is specified as a follow:

\[ Pr(x_{vi} \geq k | \theta_{v}, b_{ik}, a_{i}) = \frac{1}{1 + exp(-a_{i}(\theta_{v}-b_{ik}))} \] Here, \(\theta_{v}\) represents the latent score for children v, \(a_{i}\) represents the information parameter for each item i, and \(b_{ik}\) indicates the location parameter for each item i and score category k. The information parameter (\(a_{i}\)) indicates how well an item can distinguish between children with very similar latent abilities. The location parameter (\(b_{ik}\)) indicates whether children need a higher or lower level of perceived gesture ability, \(\theta_{v}\), to respond at or above that level k. Before estimating scores, the difficulty and discrimination parameters were evaluated as appropriate fit using criteria values within the range between \(-3.0\) and \(3.0\) and from \(0.5\) to \(5.0\), respectively.

The sub-scale score can be interpreted on a standard normal scale, where −1 and +1 are one standard deviation below and above the mean, respectively. Table 2 shows the Cronbach-\(\alpha\) coefficient, that is a measure of internal consistency, for each gesture sub-scale. Except by gesture games, all gestures sub-scales showed an acceptable internal consistency.

Furthermore, figures 1 and 2 present the curve items characteristics of each item for the first gestures and gestures object sub-scales which were used in the subsequent analysis.

Table 2. Internal Consistency for Gestures Sub-scale
Sub-Scales	No. Items	Cronbach
Gestures adult	15	0.885
First Gestures	12	0.852
Gestures games	6	0.593
Gestures objects	17	0.891
Gestures parent	13	0.885

Figure 1 “Item Characteristic curves for First Gestures Sub-scale, 2PL Model

Figure 2 Item Characteristic curves for Gestures Objects Sub-scale, 2PL Model

Results

a. Descriptive information

Table 3 shows the descriptive information for language acquisition scores (comprehension and production) and the gestures sub-scales. Production scores range from 0 to 376 (Mean = \(24.3\), S.D = \(45.7\)), while comprehension score range between 0 and 396 (Mean = \(114.9\), S.D= \(94.5\)). Scores for first gestures and gestures objects sub-scales takes values between \(-2.4\) and \(2.3\), with mean \(0\) and standard deviation approximately \(1\).

Table 3 Descriptive statistics language acquisition and gestures development scales

Variable

Stats / Values

Freqs (% of Valid)

Missing

comprehension [integer]

Mean (sd) : 114.9 (94.5)

min ≤ med ≤ max:

0 ≤ 89 ≤ 396

Q1 - Q3 : 38 - 170

310 distinct values

0 (0.0%)

production [integer]

Mean (sd) : 24.3 (45.7)

min ≤ med ≤ max:

0 ≤ 7 ≤ 376

Q1 - Q3 : 2 - 24

145 distinct values

0 (0.0%)

gestures_adult [numeric]

Mean (sd) : 0 (0.9)

min ≤ med ≤ max:

-1.5 ≤ 0 ≤ 2.3

Q1 - Q3 : -0.7 - 0.7

575 distinct values

15 (1.3%)

gestures_first [numeric]

Mean (sd) : 0 (0.9)

min ≤ med ≤ max:

-2.4 ≤ 0 ≤ 2.3

Q1 - Q3 : -0.6 - 0.6

881 distinct values

4 (0.3%)

gestures_games [numeric]

Mean (sd) : 0 (0.8)

min ≤ med ≤ max:

-2 ≤ 0.2 ≤ 1

Q1 - Q3 : -0.4 - 0.6

60 distinct values

11 (1.0%)

gestures_objects [numeric]

Mean (sd) : 0 (0.9)

min ≤ med ≤ max:

-2.2 ≤ 0 ≤ 2.1

Q1 - Q3 : -0.7 - 0.7

634 distinct values

11 (1.0%)

gestures_parent [numeric]

Mean (sd) : 0 (0.9)

min ≤ med ≤ max:

-1.1 ≤ 0 ≤ 2.3

Q1 - Q3 : -1.1 - 0.7

350 distinct values

15 (1.3%)

Generated by summarytools 1.0.0 (R version 4.1.2)
2021-11-12

Figure below displays the distribution of child age, language acquisition scales, and gestures sub-scales used in the subsequent analysis. This figure shows that scores’ distribution of language production is severely positive skewed (or right-skewed), that is most values are clustered around the left tail of the distribution. Indeed, asymmetry and kurtosis values were \(3.54\) an \(15.20\), respectively. According to George & Mallery (2010), skewness and kurtosis values between \(-2\) and \(+2\) are considered acceptable in order to prove normal univariate distribution, which is a strong assumption for linear regression analysis. In contrast, comprehension score has asymmetry and kurtosis within this range (\(0.94\) and \(0.084\), respectively).

Figure 3 Distribution of variables

Finally, Figure 4 presents the correlation plot, which provides a visual representation of bi-variate relationships between the variables included in the analysis. Scattterplots includes the estimation of linear (red line) and local polynomial regression (green line), thus latter is a nonparametric method where the linearity assumptions of conventional regression methods have been relaxed. Results suggested that relationship between production and gestures could be non-linear. Furthermore, variables includes in the analysis were positive and moderate or fairly strong correlated, with pearson correlation coefficients range between \(0.493\) and \(0.773\).

Figure 3 Scatterplot and Correlation between variables

b. Linear Regression Models

Table 4 presents the linear regression models results, which examined the degree to which age, first gestures, and gestures about objects predicts language comprehension and production.

The results across the regression show that age, first gestures and gestures objects are positive and significant predictors of language acquisition indicators. According to results, on average an increase one month in child age increase the comprehension and production in \(5.73\) and \(3.38\) points, while a variation in one point (equal to one standard deviation) in gestures about objects score increase scores in comprehension and production in \(42.77\) and \(11.70\) points, respectively. Furtheremore, explanoty variables explained the \(58.1%\) and \(32.4%\) of variance in comprehension and production scales, respectively.

Table 4. Linear Regression Models results for Comprehension and Production
	Comprehension			Production
Predictors	Estimates	std.Error	p-value	Estimates	std.Error	p-value
Intercept	40.62 ^**	13.67	0.003	-19.62 ^*	8.41	0.020
Age (Months)	5.73 ^***	1.04	<0.001	3.38 ^***	0.64	<0.001
First Gestures	23.92 ^***	3.17	<0.001	8.93 ^***	1.95	<0.001
Gestures Objects	42.77 ^***	3.60	<0.001	11.70 ^***	2.22	<0.001
Observations	1137			1137
R² / R² adjusted	0.581 / 0.579			0.324 / 0.322
p<0.05 p<0.01 * p<0.001

Linear regression makes several assumptions about the data, such as linearity of the data (relationship between the predictor and the outcome is assumed to be linear), normality of residuals (residual errors are assumed to be normally distributed), homoscedasticity or homogeneity of residuals variance, and independence of residuals error. Figures below check whether these assumptions hold true in the regression models conducted before.

The linear assumption can be checked by inspecting the Residuals vs Fitted plot, where a horizontal line, without distinct patterns is an indication for a linear relationship. In both regression model, there is no pattern in the residual plot, suggesting that we can assume linear relationship between the predictors and the outcome variables.

The normal distribution of residual can be verified by using the Normal Q-Q plot. A normal probability plot of residuals should approximately follow a straight line. In the case of model for comprehension, almost all the points fall approximately along this reference line, so we can assume normality. However, in the production model residuals did not fit a normal distribution. The assumption of homogeneity of variance of the residuals is checked by using the Scale-Location (or Spread-Location), if residuals are homoscedastic it have been equally spread along the ranges of predictors. Residuals in the production of production scores are heteroscedasticity, due to plot shows that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors. Finally, the Residuals vs Leverage plot identified that there exist extreme values that influence the estimations results. In sum, results suggested that linear regression is not the most appropriate method for predicting the language acquisition scales, particularly language production.

Figure 4 Linear Regression Diagnostic plots. Results for language comprehension

Figure 5 Linear Regression Diagnostic plots. Results for language production

d. Beta regression with variable dispersion

The class of beta regression models is an alternative approach to manage data that incorporates features such as heteroskedasticity or skewness. The beta regression models,introduced by Ferrari and Cribari-Neto (2004), is useful for modeling continuous variables \(y\) that assume values in the open standard unit interval \((0; 1)\). It is based on the assumption that the dependent variable is beta-distributed and that its mean is related to a set of regressors through a linear predictor with unknown coefficients and a link function. If the variable \(y\) assumes the extremes 0 and 1, a useful transformation in practice, proposed by Smithson and Verkuilen (2006) is: \(\frac{y*(n-1)+0.5}{n}\) where \(n\) is the sample size.

Tables 6 and 7 compare linear and beta regression models for comprehension and production, respectively, while figures 6 and 7 display below show the diagnostic plots. Results indicate that beta-regression is a better approach for estimating these variables, particularly language production, where the explanation of variance increase to \(55.3\%\) in comparison to the OLS model ((\(R^2 =32.2\%\)). In both models, the effect of age, first gestures and gestures about the objects is positive and statistically significant. Finally, Figures 6 and 7 assess the goodness of fit using different types of diagnostic plot.

Table 5. Comparison OLS vs Beta-Regression: Comprehension
	Linear Regression			Beta-Regression
Predictors	Estimates	std.Error	p-value	Estimates	std.Error	p-value
Intercept	40.62 ^**	13.67	0.003	0.18 ^***	0.03	<0.001
Age (Months)	5.73 ^***	1.04	<0.001	1.06 ^***	0.01	<0.001
First Gestures	23.92 ^***	3.17	<0.001	1.51 ^***	0.06	<0.001
Gestures Objects	42.77 ^***	3.60	<0.001	1.71 ^***	0.08	<0.001
Observations	1137			1137
R² / R² adjusted	0.581 / 0.579			0.558
AIC	12597.375			-1492.095
p<0.05 p<0.01 * p<0.001

Table 6. Comparison OLS vs Beta-Regression: Production
	Linear Regression			Beta-Regression
Predictors	Estimates	std.Error	p-value	Estimates	std.Error	p-value
Intercept	-19.62 ^*	8.41	0.020	0.02 ^***	0.00	<0.001
Age (Months)	3.38 ^***	0.64	<0.001	1.09 ^***	0.02	<0.001
First Gestures	8.93 ^***	1.95	<0.001	1.35 ^***	0.06	<0.001
Gestures Objects	11.70 ^***	2.22	<0.001	1.48 ^***	0.08	<0.001
Observations	1137			1137
R² / R² adjusted	0.324 / 0.322			0.553
AIC	11493.654			-5106.356
p<0.05 p<0.01 * p<0.001

Figure 6 Beta Regression Diagnostic plots. Results for language comprehension

Figure 7 Beta Regression Diagnostic plots. Results for language production

Research Question 2. Logistics Hierarchical Models

The gestures that children produce early in development are related to the progress they make in lanuage acquisition. Furthermore, once language has been mastered, children’s gestures facilitated their learning of other concepts. This section presents the regression of a hierarchical logistic model used to explore a relationship between the children’s sociodemographic characteristics and the likelihood of producing gestures between 8 and 18 months.Particularly, the analysis considered that data by gestures have a hierarchical structure; thus, items of gestures production are nested in groups (children), that is, children repeat gestures. Overall, the dataset includes \(65.415\) items about gestures (level 1) observed in \(1.149\) children (group or level 2). The following models were estimated:

Model 0: Logistic model without random effect. This model does not include any explanatory variable and assumes independence of gestures, this assumption is unappropriated because gestures are repeated observation by each child

\[ ln(\frac{p(Y=1)}{1-p(Y=1)})=\beta_{0}\]

Model 1: Logistic model with random intercept. This model does not include any explanatory variable and also includes an intercept random effect (\(\tau_{00}\)), meaning that we expected that the probability to produce gestures (\(\beta_{0}\)) varies by children.

\[ ln(\frac{p(Y=1)}{1-p(Y=1)})=\beta_{0}+\sigma^{2}\]

\[ \beta_{0}= \gamma_{00} + \tau_{00}\] Model 3: Logistic model with two random intercept effects. This model includes two intercept random effects, thus considers that the probability to produce gestures (\(\beta_{0}\)) varies by children (\(\tau_{00}\)} and type of gestures (\(\tau_{01}\).

\[ ln(\frac{p(Y=1)}{1-p(Y=1)})=\beta_{0}+\sigma^{2}\]

\[\beta_{0}= \gamma_{00} + \tau_{00} +\tau_{01}\] Model 4: Logistic model with two random intercept and fixed effects. In addition, to intercept random effects, this model includes explanatory variables (age (months), mother education (1= College degree or above), and minority (1= No White race)) at level 2 for explaining differences in the probability to produce gestures between children (\(\beta_{0}\)).

\[ ln(\frac{p(Y=1)}{1-p(Y=1)})=\beta_{0}+\sigma^{2}\]

\[ \beta_{0}= \gamma_{00} + \gamma_{01}*age + \gamma_{02}*momedu + \gamma_{03}*minority + \tau_{00} +\tau_{01} \]

Table 10 presents the summary of models estimation. Values for coefficients effect are presented as odds ratios (OR), a measure of the strength of association with an explanatory and an outcome variable. OR \(>1\) means greater odds of association between variables, while OR \(< 1\) means there is a lower odds of association between the explanatory and outcome variable.

The negative coefficient in Model 0 indicates that it is more likely not to produce gestures in this particular sample. Indeed, the \(53.77\%\) of gestures items were not produce by children in the sample.

Model 1 examines whether the intercept varies from child to child; this means the probability to produce gestures as a random effect. This variation (\(\tau_{00}\)) is \(1.18\). This random effect is statistically significant, concluding that overall, the probability of producing gestures varies between children.

Table 8. Confidence Interval for Model 1
	2.5 %	97.5 %
.sig01	1.0344090	1.1405162
(Intercept)	-0.2708215	-0.1340603

In Model 2 the two random effects (\(\tau_{00}\) = \(1.54\)) and (\(\tau_{00}\) = \(0.78\)) estimated are statistically significant (Table 9), this means that the probability to produce gestures varies between children and type of gestures.

Table 9. Confidence Interval for Model 2
	2.5 %	97.5 %
.sig01	1.1826768	1.3027953
.sig02	0.5284464	1.9276818
(Intercept)	-1.0560035	0.8507078

Finally, Model 3 includes additional fixed effects in level 2 (child), which estimates whether there exists a relationship between children’ socio-demographic variables and the likelihood that a child will produce gestures. Age is a positive and significant effect, with means that the odd of produce gestures increase when children’ age. In contrast, the effect of mother education is also significant, however have a mother with college degree or above is associate with a lower odds of gestures production. With respect to the mother’s education level, the empirical evidence in this field have pointed out its influence in early language development. Nevertheless, this effect seems to be mediated by the linguistic input that the child receives and the quality of parental communication (e.g., direct speech, routines) (Serrat-Sellabona, 2021). Then, it is possible that the effect of mother education is mediated by other variables that were not included in the analysis or it is also possible that it has a greater impact in later development, instead in pre-linguistic stages of language acquisition.

Table 10. Results Logistic Hierarchical Linear Models
	Model 0			Model 1			Model 2			Model 3
Predictors	Odds Ratios	std.Error	p-value	Odds Ratios	std.Error	p-value	Odds Ratios	std.Error	p-value	Odds Ratios	std.Error	p-value
Intercept	0.86 ^***	0.01	<0.001	0.82 ^***	0.03	<0.001	0.90	0.36	0.794	0.01 ^***	0.00	<0.001
Age (Months)										1.40 ^***	0.01	<0.001
Mother Education (College degree or above)										0.90 ^*	0.05	0.047
Minority										1.04	0.06	0.477
Random Effects
σ²				3.29			3.29			3.29
τ₀₀				1.18 _{data_id}			1.54 _{data_id}			0.56 _{data_id}
							0.78 _type			0.78 _type
N				1044 _{data_id}			1044 _{data_id}			1044 _{data_id}
							5 _type			5 _type
Observations	65415			65415			65415			65415
R² Tjur	0.000			0.000 / 0.264			0.000 / 0.414			0.163 / 0.406
AIC	90314.091			79769.861			72304.183			71421.092
p<0.05 p<0.01 * p<0.001

References

Cribari-Neto,F. and Zeileis, A (2010). Beta Regression in R. Journal of Statistical Software, April 2010, Volumne 34, Issue 2.

George, D. and Mallery, P. (2010) SPSS for Windows Step by Step: A Simple Guide and Reference 17.0 Update. 10th Edition, Pearson, Boston.

Serrat-Sellabona, E., Aguilar-Mediavilla, E., Sanz-Torrente, M., Andreu, L., Amadóm A., and Serra, M. (2021) Sociodemographic and Pre-Linguistic Factors in Early Vocabulary Acquisition, Children, 8, 206

Smithson M, Verkuilen J (2006). Better Lemon Squeezer? Maximum-Likelihood Regression with Beta-Distributed Dependent Variables.” Psychological Methods, 11(1), 54{71.

R Project

Carolina Lopera

11/5/2021

Introduction

Methods

a. Participants

Measures

a. Language acquisition

b. Gestures development

Results

a. Descriptive information

b. Linear Regression Models

d. Beta regression with variable dispersion

Research Question 2. Logistics Hierarchical Models

References