A number of psychometric properties were used to investigate the validity and reliability of the numeracy and literacy (sub)tests. For each scale the mean scale score was calculated as well as the standard error of the mean scale score. The mean scale score was divided by the number of items of the scale which results in mean scale scores between 0 and 1. These means and standard errors were compared for the two versions (A and B) of the subtests in order to explore the test equality of the two versions. We decided that there is a significant difference in means when the difference in means is larger than the pooled standard error of the two versions. This would indicate an effect size of 1. When the difference in standard error of the two versions was larger than .05 it was decided that the two versions had significant difference in variation of the scores. Note that these criteria are liberal and only meant for exploring the equality of the two versions and detecting unequality. When it is decided that the two versions are both used, a more stringent test is required on the final versions of the test.
Furthermore Loevinger H coefficient was calculated (Sijstma and Molenaar, 2002). This nonparametric scaling coefficient shows whether the items of the scale form an ordinal scale. H below .3 do indicate that there is no scale, H between .3 and .4 indicate a weak ordinal scale, H between .4 and .6 indicate a medium scale and an H larger than .6 indicate a strong scale. We used this H coefficient instead of factor analyses because the items are binary, which violates the normality assumption of the factor analysis. Since nonparametric IRT does not rely on any distributional assumptions, it may be a better option here. Second, the sample size is relatively small, which may be problematic given the asymptotic assumptions in the factor analysis. Nonparametric IRT may therefore be preferred.
We compared the H coefficient for the two versions in order to evaluate whether the two versions have comparable ordinal scalability. When the difference in H coefficient is larger than .1, we consider the two versions to differ. Note again that this criterion is liberal and only meant for exploring the equality of the items from the two versions.
Finally, the reliability of the scale was calculated by lambda2 . The reliability coefficient has to be at least .8 to conclude that the scale is reliable. We compare the reliability for the two versions. When lambda2 of the two versions differs more than .1, we decided that the two versions differ in reliability.
At item level we calculated the mean (proportion correct) and the item scalability, Hi . Note that the standard error is not relevant because it is a function of the proportion correct. This Hi is a measure of the scalability of the item and is comparable to the factor loading of an item in factor analysis. When an item has a relatively low Hi , the item does not fit well in the scale. This may be the case when an item is malfunctioning or/and when it is measuring something else than the other items in the scale. Furthermore, Hi helps us evaluate whether there are items that are too difficult or easy. Finally, since one of the purposes of the validation study was to shorten the tests, we also evaluated which items may be removed.
We also compared the mean and the Hi for the two versions of the assessment. Since the items within a part of the subtest can be assumed to be replications and therefore exchangeable, we calculated the mean of the proportion correct and the Hi value for parts and compare these values for the two versions. For the average proportion correct, we decided that the two versions differed for the part when the difference in means divided by the pooled standard error was larger than 1. This indicates an effect size of 1. For the scalability coefficient, we decided that the scalability for the two versions of the part differ when the difference in average Hi for the items within a part differed more than .1 for the two versions. When there were differences at the parts level, we explored the items within a part to detect items that may explain the differences in proportion correct and / or scalability.
In order to evaluate whether the tests are functioning equally for girls and boys we compared the two groups on scale scores and item scores. The standard deviation and the H/Hi coefficients were compared using the same criteria as was done for the comparison of the two versions. The mean scale scores and mean items scores were also compared but it has to be emphasized that difference in these mean scores do not necessarily mean that there is differential item functioning. It may well be the case that there are group differences between girls and boys but that items measure the same construct when conditioned at the total number correct. Still, it might be interesting to see whether there are any scales and items that show different proportions correct for girls and boys.
We started with recoding the answers into binary scores. A correct answer was recoded into 1, an incorrect answer and the answer not able to answer was recoded into 0.
The literacy test consists of nine subtests, named Concepts about print, Phonological awareness, Vocabulary, Listening comprehension, Letter naming, Sound naming, Reading, Reading comprehension and Writing. Below the number of items and the percentage missing values is shown.
The subtests Phonological awareness, Vocabulary, Reading and Writing consists of a number of subtests themselves. Children only progress to the next sub-subtest when they have a certain score on the previous sub-subtest. As a consequence there are a lot of missing values for children who did not do the more difficult sub-subtests.
In order to do the psychometric analyses on as many data as possible, we decided to analyse the data on sub-subtest level. Below we show the number of items and the number of missing per sub-subtest. The subsubtests Phonological awareness - syllable deletion, One minute word reading, Reading sentences and Reading comprehension have a very large amount of missings.For these subsubtests it is not meaningful to do any psychometric analyses.
The scale means show that version A is a bit less difficult than version B. The standard errors are of the two versions are similar. The H values are high, indicating a strong ordinal scale and Lambda2 shows that the subtest Concepts about print is sufficiently reliable and the two versions have equal reliability.
At scale level, version A has stronger scalability for males than females. The standard error, and Lambda2 are similar.
At item level it turned out that all three items of version A have stronger scalability for males than females, however this difference is not significant.
The items of the Concepts about print subtest form a strong ordinal scale with sufficient reliability. Version A is a bit easier than Version B, Because the two versions had exactly the same items, the difference in difficulty is due to sample fluctuation. We can use this result to interpret the (in)equality of the two versions for the other subtests. Version A has stronger scalability for males than for females.
The Phonological awareness subtest consists of three subsubtests. Since the subsubtest syllable deletion has a very large amount of missings, this subtest was not analysed. The scale psychometrics for syllable blending and syllable segmentation show that both subsubtests are strong scales and reliability is good. The two versions do not differ significant on scale statistics. The subsubtest syllable segmentation is a very difficult subsubtest.
The item psychometrics show that the proportion correct of the items do not differ significantly for the items for both subsubtests. Moreover, the Hi values are similar for the two versions for both subsubtests.
Both subsubtests do not differ with respect to se, H, and Lambda2 for males and females for both versions.
At item level there are no significant differences with respect to scalability Hi for males and females.
The items of the syllable blending and syllable segmentation form reliability strong scales that do not differ for the two versions nor for gender.
The subtest Vocabulary consists of two subsubtest: Receptive vocabulary and Productive vocabulary. For Receptive vocabulary version B is less difficult than version A. For Productive vocabulary version B is more difficult than version A. The scalability H is intermediate for both subsubtest. The reliability is good.
The item psychometrics show that the item difficulty varies greatly between the two versions for both subsubtests. The Hi values show that items e2a3, version A, e3b9, version B have weak scalability (Hi<.3).
For receptive vocabulary the scalability is better for males than females for version A. The se and the reliability do not differ for males and females. With respect to productive vocabulary the scalability is better for males than females for version B. The se and the reliability do not differ for males and females.
At item level items e3a1, e3a5, e3a11 and e3a16 have (much) higher Hi values for males than females in version A. These items should be revised to make them more similar for males and females. For productive vocabulary items e3b1 and e3b6 version A have higher Hi values for males than females. Items f3b1, f3b6, f3b9, f3b10, version B, males have higher Hi values than females. These items should be revised to make them more similar for males and females.
The items of receptive vocabulary form an intermediate strong scale with good reliability. However, version B is less difficult than version A. (Note that this difference may even larger because we know from the first subtest that children in Version B perform less well) Item e2a3, version A, has weak scalability. There is differential item functioning for gender for items e3b1 and e3b6 version A. The items of productive vocabulary also form an intermediate strong scale with good reliability. However, version B is more difficult than version A. (Note, however, that this difference may not be relevant because we know from the first subtest that children in Version B perform less well) Item e2a3, version A, has weak scalability. There is differential item functioning for gender for items e3b1 and e3b6 version A. Item e3b9, version B has weak scalability. There is differential item functioning for gender for items e3b1, e3b6 and items f3b1, f3b6, f3b9, f3b10.
The subtest Listening comprehension differs in difficulty level for the two versions. Version B is more difficult. The scalability is strong for both versions but significantly stronger for version A. The reliability is good.
The aggregated means show that for for parts, version A is less difficult than version B. The Hi values do not differ.
The item psychometrics show that all items of part a are more difficult for version A than for version B. For part b, the difference in difficulty is most pronounced in item e4b1 which is more difficult in version B.
At scale level, it turned out that there is no differential item functioning for gender. The standard error, the H value and Lambda2 are similar. The scale means are different for females and males, the subtest is more difficult for females than for males.
At item level there is no differential item functioning for gender.
The items of the Listening comprehension subtest form a strong ordinal scale with good reliability. Version A is more difficult than version B, for both parts, and all items within the parts. (Note that this difference may even larger because we know from the first subtest that children in Version B perform less well). There is no differential item functioning for gender.
The subtest Letter naming has a lot of missings, so the sampe size for the psychometric analysis is small. Version A is less difficult than version B. The scalability is intermediate. The reliability is good.
The aggregated means show that the items in parts a and d are more difficult in version B. The scalability of the four parts does not differ for the two versions.
In part a, items e5a1 or e5a3 may be interchanged in version A and B. In part d, e5d3 or e5d5 may be interchanged in version A and B to make the two versions more similar.
At scale level, it turned out that there is no differential item functioning for gender. The standard error, the H value and Lambda2 are similar.
At item level, in version A, items e5a4,e5b2 and e5b3 have stronger scalability for males than females. In version B, items f5a3, f5b5 and f5d2 have stronger scalability for females than males. These items might be replaced.
The subtest Letter naming has good reliability and scalability is intermediate. Version A is less difficult than Version B and this seems to be caused by some part a and part d items. (Note, however that this difference may be not relevant because we know from the first subtest that children in Version B perform less well) There are some items that differ in scalability for males and females, both for version A and version B.
The subtest Sound naming really has a lot of missings. The sample size of version A is only 27 and of version B it is only 14. This is too small to do any psychometric analysis.
The subtest Reading words really has too many missings to do any psychometric analysis.
The subtest Writing letters forms a strong scale with good reliablity. There no significant differences between version A and B.
The item psychometrics show that the items have strong scalability in both versions. On average, item difficulty is similar for both versions.
At scale level, it turned out that there is no differential item functioning for gender. The standard error, the H value and Lambda2 are similar. The scale means are different for girls and boys, the subtest is more difficult for girls than for boys.
At item level there is no differential item functioning for gender.
The items of the writing letters subtest form a strong ordinal scale with good reliability. The two versions do not differ and there is no differential item functioning.