A number of psychometric properties were used to investigate the validity and reliability of the math and literacy (sub)tests. For each scale the mean scale score was calculated as well as the standard error of the mean scale score. The mean scale score was divided by the number of items of the scale which results in mean scale scores between 0 and 1. These means and standard errors were compared for the two versions (A and B) of the subtests in order to explore the test equality of the two versions. We decided that there is a significant difference in means when the difference in means is larger than the pooled standard error of the two versions. Note that this would indicate an effect size of 1. When the difference in standard error of the two versions was larger than .05 it was decided that the two versions had significant difference in variation of the scores.
Furthermore Loevinger H coefficient was calculated. This nonparametric scaling coefficient shows whether the items of the scale form an ordinal scale. H below .3 do indicate that there is no scale, H between .3 and .4 indicate a weak ordinal scale, H between .4 and .6 indicate a medium scale and an H larger than .6 indicate a strong scale. We used this H coefficient instead of factor analyses because the items are binary, which violates the normality assumption of the factor analysis. Since nonparametric IRT does not rely on any distributional assumptions, it may be a better option here. Second, the sample size is relatively small, which may be problematic given the asymptotic assumptions in the factor analysis. Nonparametric IRT may therefore be prefered.
We compared the H coefficient for the two versions in order to evaluate whether the two versions have comparable orindal scalability. When the difference in H coefficient is larger than .1, we consider the two versions to differ.
Finally, the reliability of the scale was calculated by lambda2 . The reliability coefficient has to be at least .8 to conclude that the scale is reliable. We compare the reliability for the two versions. When lambda2 of the two versions differs more than .1, we decided that the two versions differ in reliability.
At item level we calculated the mean (proportion correct) and the item scalability, Hi . Note that the standard error is not relevant because it is a function of the proportion correct. This Hi is a measure of the scalability of the item and is comparable to the factorloading of an item in factor analysis. When an item has a relatively low Hi , the item does not fit well in the scale. This may be the case when an item is malfunctioning or/and when it is measuring something else than the other items in the scale. Furthermore We evaluated whether there are items that are too difficult or easy. Finally, since one of the purposes of the validation study was to shorten the tests, we also evaluated which items may be removed.
We also compared the mean and the Hi for the two versions. Since the items within a part of the subtest can be assumed to be replications and therefore exchangeable we calculated the mean of the proportion correct and the Hi value for parts and compare these values for the two versions. For the average proportion correct, we dicided that the two versions differed for the part when the difference in means divided by the pooled standard error was larger than 1. This indicates an effect size of 1. For the scalability coefficient, we decided that the scalability for the two versions of the part differ when the difference in average Hi for the items within a part differed more than .1 for the two versions. When there were differences at the parts level we explored the items within a part to detect items that may explain the differences in proportion correct and / or scalability.
In order to evaluate whether the tests are functioning equally for girls and boys we compared the two groups on scale scores and item scores. The standard deviation and the H/Hi coefficients were compared using the same criteria as was done for the comparison of the two versions. The mean scale scores and mean items scores were also compared but it has to be emphasized that difference in these mean scores do not necessarily mean that there is differential item functioning. It may well be the case that there are group differences between girls and boys but that items measure the same construct when conditioned at the total number correct. Still, it might be interesting to see whether there are any scales and items that show different proportions correct for girls and boys. So we evaluate this as well.
We started with recoding the answers into binary scores. A correct answer was recoded into 1, an incorrect answer and the answer not able to answer was recoded into 0.
The math test consists of seven subtests, named Number recognition,Counting,Quantity discrimination,Number and place value,Addition and subtraction,Multiplication and division,Time,Shape. Below the number of items and the percentage missing values is shown.
The subtests Number and place value,Addition and subtraction,Multiplication and division,Time and Shape consists of a number of subtests themselves. Children only progress to the next sub-subtest when they have a certain score on the previous sub-subtest. As a consequence there are a lot of missing values for children who did not do the more difficult sub-subtests.
In order to do the psychometric analyses on as many data as possible, we decided to analyse the data on sub-subtest level. Below we show the number of items and the number of missing per sub-subtest. The subsubtests a/b13 and a/b14 from the subtest Time have a very large amount of missing values. For these subsubtests it doesn’t make sense to do any psychometric analyses.
The table below shows the scale properties for the two versions.The scale means show that version A is a bit less difficult than version B. The standard errors are of the two versions are similar. The H values are high, indicating a strong ordinal scale and Lambda2 shows that the subtest recognition is reliable and the two versions have equal reliability.
The aggregated means show that part c is more difficult in version B. The item properties show that item c1 (100 vs 101) and c2 (115 vs 135) are more difficult in version B. The aggregated Hi show that part c differs in scalability. The item properties show that c3, c4 and c5 have higher Hi values in version A.
At scale level, it turned out that there is no differential item functioning for gender. The standard error, the H value and Lambda2 are similar. The scale means are different for girls and boys, the subtest is more difficult for girls than for boys.
At item level it turned out that only the items of subtest part d differ for girls and boys. Since this subtest has very low proportions correct and therefore hardly any variation is scores, it is not meaningful to interpret these differences in scalability.
The items of the recognition subtest form a strong ordinal scale with sufficient reliability. The items of part d are very hard and hardly have any variation. The two versions of the subtest do not differ with respect to scalability and reliability but version A is less difficult than version B. This is probably caused by some part c items. There is no measuserement invariance for gender except for the malfunctioning part d items.
The scale means show that version A is a bit easier than version B but this difference is not significant.The standard errors are of the two versions are similar. The H values are high, indicating a strong ordinal scale and Lambda2 shows that the subtest Counting is reliable and the two versions have equal reliability.
At scale level, it turned out that there is differential item functioning for gender for version A. The H value and Lambda2 are higher for males indicating that the scale discriminates better and has higher reliability for males. Both versions, however, have strong scalabality and high reliability for both males and females.
At item level it turned out that scalability is a bit better for males than for females in version A. Since the item H’s are high for all items, for both versions and males and females, there might be no need to adjust the items.
The four items of the count subtest form a very strong ordinal scale with high reliability. There are no substantial differences between version A and version B. The scalability is in version A is a bit stronger for males, but still strong for both grouops.
The scale means show that version A is a bit less difficult than version B. The standard errors are of the two versions are similar. The H values indicate a strong ordinal scale and Lambda2 shows that the subtest Quanitity discrimination is reliable and the two versions have equal reliability.
At scale level, it turned out that there is no differential item functioning for gender. The standard error, the H value and Lambda2 are similar.
At item level there is no differential item functioning in for gender.
The items of the Quanitity discrimination subtest form a strong ordinal scale with high reliability. Items b2a3 and item a2a2 may be exchanged in order to equalize the two versions A and B. There is no differential item functioning.
The Number and place value scale consists of three subtests, a/b3, a/b4 and a/b5. Because the subsubtest a/b5 has a very large amount of missings, we can’t evaluate the psychometrics of this part. Subsubtest a/b4 also had quite some missings, but still enough to do the psychometrics. In order to have the largest possible sample size for the analysis, we analysed the a/b3 and a/b4 parts separately.
The first table below shows the scale properties for the two versions of the a/b3 subtest.The scale means show that version A is a less difficult than version B. The standard errors are of the two versions do not differ. The H values for version A indicates a strong scale, but the H value for version B indicates medium scalability. The reliability for the two versions is high and similar.
The second table shows the scale properties for the two versions of the a/b4 subtest. Note that the sample size dropped with one third. The scale means show that this subsubtest is very difficult for both versions. The test is most difficult for version A. The H values for version A and B are similar and indicate medium scalability. The reliability for the two versions is high and similar.
The aggregated means of part a/b3 show that part a is more difficult in version B. This difference is significant. The item properties show that item a3a1 (biggest 18 vs 15) is (much) easier than all other items in version A and version B. The aggregated Hi show that none of the parts have a significant difference between the H values (i.e.\(Diff>.15\)) for the two versions. Overall the Hi values of the items of parts b, c, and d are lower in version B than in version A. It might be a possibility to exchange some items of these parts to equalize the scalability.
The aggregated means of part a/b4 show that part c is more difficult in version A. This difference is significant. The item properties show that all part c items are more difficult in version A. It may therefore be a good idea to exchange two of the four items of version A with version B. The item properties further show that the proportion correct for the a/b4 items are all very low indicating that this subsubtest is very difficult.
At scale level, it turned out that there is no differential item functioning for gender for the a/b3 part. For the a/b4 subsubtest there is differential item functioning for version A. The scalability is better for males than for females. At item level we see no differential item functioning for a/b3 items. For subsubtest a/b4 items a2, a4, b1, b3 and b4 show differences between males and females in scalability. Since these items are very difficult, the variation is scores is also very restricted. Therefore, the scalability of the items may not be reliable.
The items of the Number and Place value subsubtest a/b3 form a strong ordinal scale with good reliability. There is some difference in scalability for the two versions and I would advise to exchange the items of part b, c, and d. There is no differential item functioning for this subsubtest.
Subsubtest a/b4 turned out to be very difficult. Because of this difficulty, the scalability of the versions differs and there are also differences in scalability for males and females. Subsubtest a/b5 had too few observations to do any psychometric analyses.
The subtest Addition and subtraction consists of three subsubtests a/b6, a/b7 and a/b8. Susbsubtests a/b7 and a/b8 have a lot of missings so we will analyze the three subsubtest separately.
The first table below shows the scale properties for the two versions of the a/b6 subtest.The scale means show that both versions are equally difficult, and the standard errors are of the two versions do not differ. The H values for version A indicates a strong scale, but the H value for version B indicates medium scalability. The difference in scalability is not significant. The reliability for the two versions is high and similar.
The second table shows the scale properties for the two versions of the a/b7 subtest. Note that the sample size dropped with 60%. The scale means show that this subsubtest more difficult for version B. The H values for version A and B are similar and indicate strong scalability. The reliability for the two versions is high and similar.
The third table shows the scale properties for the two versions of the a/b8 subtest. Note that the sample size is very small for psychometric analysis. As a consequence we have to interpret the results with caution. The scale means show that this subsubtest is very difficult for both versions. The test is most difficult for version B. The H values for version A and B differ. Both H values indicate strong scalabilty but the test has stronger scalability for version B. The reliability for the two versions is high and similar.
The aggregated means of subsubtest a/b6 show that the means and H values do not differ for the different parts for version A and version B. For subsubtest a/b7 there are differences for all parts (a, b, c and d) with respect to the means. Version A is easier than version B. The item means of these parts show that the proportion correct for almost all items is lower in version B than in version A. So it might be a good idea to exchange half of the items. The scalability does not differ for the different part.
For subsubtest a/b8 the means and scalability of all three parts differ. Part c items are too difficult and this part should be removed. Parts a and b are more difficult for version B than for version A. The scalability is better for version B for these parts but because the sample size is small and the proportion correct very low for version B items, these scalability coefficients are not reliable. The item means show that all items of version A are easier than the items of version B. So it might again be a good idea to exchange half of the items.
We only evaluate the differential item functioning for subsubtests a/b6 and a/b7. At scale level, it turned out that there is no differential item functioning for gender for the a/b6 and a/b7 subsubtests.
At item level there is no differential item functioning for a/b6 items for gender. For the a/b7 subsubtest items bc7 and bd3 differ in scalability for version B. These items have stronger scalability for females than for males.
The items of the Addition and subtraction subsubtest a/b6 form a strong ordinal scale with good reliability. There is no difference in scalability for the two versions. There is no differential item functioning for this subsubtest.
Subsubtest a/b7 turned out to be more difficult for version B items. The items of the two versions of this subsubtest can probably be exchanged to equalize the two versions. There is some differential item functioning for items bc7 and bd3. The part c items of subsubtest a/b8 were too difficult and should be removed. The items of the part a and part b items should be exchanged for the two versions in order to equalize the two versions.
The subtest Multiplication and Division consists of three subsubtests a/b9, a/b10 and a/b11. Susbsubtests a/b11 has too many missings so we will analyze only subsubtests a/b9 and a/b10.
The first table below shows the scale properties for the two versions of the a/b9 subtest. The scale means show that version A is less difficult than version B. Subsubtest a/b9 forms a medium scale and reliability is high.
The second table shows the scale properties for the two versions of the a/b10 subtest. Note that the sample size dropped with 68%. The scale means show that this subsubtest is very difficult and more difficult for version B. The H values for version A and B differ, version A has higher H value, but both H values are high. The reliability for the two versions is high and similar.
The aggregated means of subsubtest a/b9 show that the means of part a. b and c differ for the two versions. Version A is easier than version B. and H values do not differ for the different parts for version A and version B. The item means show that a9a1, a9a6, a9b2 and a9b3 are much easier in version A than in version B. The scalability of the three parts do not differ. For subsubtest a/b10 there are differences for all three parts with respect to the means. Part c is too difficult and should be removed from the test. a10a1, a10b1, a10b3 and a10b4 are much easier for version A. The scalability of part a differs for the two versions. The scalability of version A is much stronger than for version B.
Because the number of observations is too small to evaluate the differential item functioning for the subsubtest a/b10, we only evaluate it for subsubtest a/b9. At scale level, it turned out that there is no differential item functioning for gender. The standard error, the H value and Lambda2 are similar.
At item level there is no differential item functioning for the items of subsubtest a/b9
The subsubtest a/b9 from the Multiplication and Division subtest is more difficult for version A than for version B. Items a9a1, a9a6, a9b2 and a9b3 are much easier in version A than in version B. The subsubtest forms a medium scale and there is no differential item functioning. Subsubtest a/b10 is very difficult especially for version B. Part c is too difficult and should be removed from the test. Items a10a1, a10b1, a10b3 and a10b4 are easier for version A. The scalability of version A is much stronger than for version B.
Subtest Time has three subsubtests a/b12, a/b13 and a/b14. We will only analyse a/b12 because the other subsubtests have too many missings.
The scale means show that subsubtest a/b12 is very difficult. The scalability is stronger for version B and the reliability is high for both versions.
The aggregated means that part a is a bit easier than part b. The two versions do not differ with respect two the means. The scalability is better for version B for both parts.
At scale level, it turned out that there is differential item functioning for gender. The H value and Lambda2 are higher for females than for males in version A. Because the variation in scores is so small, the Hi values are not reliable. So no conclusion can be drawn about the differential item functioning.
The subtest Time turned out to be a very difficult one. Only subsubtest a/b12 was analyzed but the items of this subsubtest were also too difficult. I would suggest to remove this subtest from the test.
The subtest Shape consists of three subsubtests a/b15, a/b16 and a/b17. Susbsubtests a/b17 has too many missings so we will analyze only subsubtests a/b15 and a/b16.
The first table below shows the scale properties for the two versions of the a/b15 subtest. The scale means show that version A is more difficult than version B. Subsubtest a/b15 forms a strong scale and reliability is just sufficient.
The second table shows the scale properties for the two versions of the a/b16 subtest. Note that the sample size dropped with 60%. The scale means show that this subsubtest is significantly more difficult for version B. The H values for version A and B differ, version A has higher H value, indicating a strong scale. Version B has medium scalability. The reliability for the two versions is sufficient.
The aggregated means show that part a is more difficult in version A for subsubset a/b15. The item properties show that is caused by item a15a1. Part c is very difficult and should be removed from the test. The scalability of the items is similar for the items within the three parts.
With regard to subsubset a/b16, we see that both parts are more difficult for version B. The item properties show that items a16a2, a16a3 and 16b1 are probably responsible for this. The scalabity of the items of part a is stronger for version A than version B. Both items a16a1 and a16a2 have higher Hi values in version A.
At scale level, it turned out that there is differential item functioning for gender for subsubtest a/b15. For both versions the scalability is stronger for males than for females. The items show that both items a1 and a2 have higher Hi values for males for both versions. For version B item b1 also has a higher Hi value for males than females. The part c items also differ in Hi , but because there is little variation in these items the Hi values are not reliable.
For subsubset a/b16 there is no differential item functioning at scale level. At item level, we see that item b1 has a much higher Hi value for males in version B.
The items of the Shape subtest, subsubtest a/b15 forms a strong scale but version B is more difficult. This is caused by item a/b15a1. Part c is very difficult and should be removed from the test. There is differential item functioning for gender for this subsubtest. This is probably caused by items a/b15a1,a/b15a2 and a/b15b1.
Subsubtest a/b16 differs in difficulty and scalability for the two versions. The items a16a2, a16a3 and 16b1 seem to be responsible for the difference in difficulty. Items a16a1 and a16a2 may be responsible for the difference in scalability for the two versions. There is no differential item functioning for subsubtest a/b16.
With respect to the difficulty of the MATH test, al number of subsubtests may be removed from the test:
Recognition: remove part d.
Counting: all right.
Quanitity Discrimination: all right.
Number and place value: only a/b3, remove a/b4 and a/b5.
Addition and substraction: remove a/b8 part c.
Multiplication and Division: only a/b9, remove a/b10 and a/b11.
Shape: only a/b12, remove a/b13 and a/b14.
Time: remove a/b15 c and a/b17.
With respect to the equality of the two versions, some changes can be made to make the two versions more similar. See the results of the subtests for the details.
With respect to the differential item functioning, the conclusion is that the subtests function quite similar for females and males. Some small schanges can be made to make. See the results of the subtests for the details.
Questions:
1. Do you agree to create the composite scores for the subtests based on the conclusions above? 2. Do we want seperate factor analysis for the two versions?
3. If so, do we first want to adjust some items (or exchange some) to equalize to the versions?
4. If not, which version do we use?
The math test consists of seven subtests, named Number recognition,Counting,Quantity discrimination,Number and place value,Addition and subtraction,Multiplication and division,Time,Shape. Below the number of items and the percentage missing values is shown.
The subtests Number and place value,Addition and subtraction,Multiplication and division,Time and Shape consists of a number of subtests themselves. Children only progress to the next sub-subtest when they have a certain score on the previous sub-subtest. As a consequence there are a lot of missing values for children who did not do the more difficult sub-subtests.
In order to do the psychometric analyses on as many data as possible, we decided to analyse the data on sub-subtest level. Below we show the number of items and the number of missing per sub-subtest. The subsubtests a/b13 and a/b14 from the subtest Time have a very large amount of missing values. For these subsubtests it doesn’t make sense to do any psychometric analyses.
The subtest Addition and subtraction consists of three subsubtests a/b6, a/b7 and a/b8. Susbsubtests a/b7 and a/b8 have a lot of missings so we will analyze the three subsubtest separately.
At item level there is no differential item functioning for a/b6 items for gender. For the a/b7 subsubtest items bc7 and bd3 differ in scalability for version B. These items have stronger scalability for females than for males.