The purpose of this document is to present the methodology and results of the analysis of the student response data for the Reasoning construct. The response data was drawn from a sample of the the world, and the Reasoning ability of the world was of research interest. Item Response Theory analysis (IRT) was undertaken and a 1 parameter logistic model (1PLM) was applied to the partial credit data.
Table 1 presents the 13 items included in the Reasoning instrument:
| Item abbreviation | Full item description |
|---|---|
| B1 | I make insightful remarks |
| B2 | I know the answers to many questions |
| B3 | I tend to analyze things |
| B4 | I use my brain |
| B5 | I learn quickly |
| B6 | I counter others’ arguments |
| B7 | I reflect on things before acting |
| B8 | I weigh the pros against the cons |
| B9 | I consider myself an average person |
| B10 | I get confused easily |
| B11 | I know that I am not a special person |
| B12 | I have a poor vocabulary |
| B13 | I skip difficult words while reading |
Analysis was undertaken with the assistance of the \(CTT\) (Willse, 2018), \(psych\) (Revelle, 2018), \(TAM\) (Robitzsch, Kiefer, & Wu, 2018), and \(WrightMap\) (Torres Irribarra & Freund, 2018) R packages. The \(CTT\) package provided results pertaining to the frequency of response categories, point biserial (Pearson) discrimination indices, poly- and bi-serial discrimination indices, and the Cronbach’s alpha (\(\alpha\)) reliability coefficient. The full correlation matrix provided in Appendix A was created by the \(psych\) package, while the \(TAM\) package was used to carry out the 1PLM model for the partial credit data, and the \(WrightMap\) package provided for a visual illustration of the positions of items and persons.
Instrument reliability is theoretically conceived as the proportion of true score variance \(Var[T]\) to observed score variance \(Var[O]\),
\[Reliability = \frac{Var[T]}{Var[O]}\]
Because it is impossible to identify true score variance, it needs to be estimated. Derived from Generalisability Theory (Brennan, 2001), the following formula is often applied to estimate the reliability of a test instrument,
\[ Reliability =\frac{Var[O]-Var[E]}{Var[O]}\]
where \(Var[O]\) represents the amount of observed variation in test ability, and \(Var[E]\) represents the amount of variation due to error. Equivalent to the formula above, the formula for the Cronbach’s alpha coefficient (Cronbach, 1951) is expressed as follows,
\[ \alpha =\frac{K\bar{c}}{(1+(K-1)\bar{c})}\]
where \(K\) is the number of items, and \(\bar{c}\) is the average of all the covariances between all items.
In accordance with DeVellis (2012), Table 2 presents some general rules-of-thumb for interpreting the meaning of the alpha \(\alpha\) coefficient for a test instrument.
| Internal Consistency | Cronbach’s alpha |
|---|---|
| Excellent | 0.90 & up |
| Good | 0.80-0.89 |
| Acceptable | 0.70-0.79 |
| Questionable | 0.60-0.69 |
| Poor | 0.50-0.59 |
| Unacceptable | under 0.50 |
It should be noted that the Cronbach’s \(\alpha\) coefficient assumes equivalent contribution of each item to the total score (tau, \(\tau\), equivalence), consequently the \(\alpha\) coefficient is considered lower bound statistic. Based on the 1PLM undertaken, the expected a posteriori (EAP; Bock & Aitken, 1981) estimate for instrument reliability is also estimated. In addition, the person-separation reliability statistic is also be used as an assessment for the degree to which the test instrument sufficiently separates students.
The discrimination indices (correlation coefficients) provide an estimate of the contribution of each item to the construct of interest, Reasoning. The indices for each item were estimated by ommiting the item’s contribution to the summed total (often termed the item-rest correlations). This was done because inclusion of the item of interest in the summed total score artificially inflates each discrimination index. Because Pearson correlations provide attenuated estimates of the relationships involving categoric variables, polyserial correlations (and biserial correlations for dichotomous items) are also included as a means to assess item discrimination (Olsson, Drasgow, & Dorans, 1982). To assess the degree to which each discrimination index is inflated due to the inclusion of the item of interest, a full Pearson item-total correlation matrix is also generated with the assistance of the R \(psych\) package).
IRT analysis was undertaken with the assistance of the \(TAM\) package. Because the Reasoning ability of the the world was of interest, marginal maximum likelihood estimation (MML) was used. This form of estimation relies on the assumption that the construct of interest is distributed normally throughout the population (modelling of data incorporates this assumption through a Bayesian prior distributive assumption). Parameters pertaining directly to the population of interest are defined as expected a posteriori (EAP) statistics. For this level of analysis, an estimate of variance in student ability, EAP reliability estimates, item fit statistics (in- and out-fit), item expected score curves, item category characteristic curves, and Thurstonian thresholds are produced. To interpret item outfit, the confidence interval for the chi-square distribution is also included. The 95% confidence intervals for the outfit chi-square statistics were estimated in accordance with Wu and Adams (2013, p. 344),
\[95\hspace{.1cm}percent\hspace{.1cm}confidence\hspace{.1cm}intervals=1\pm2\sqrt{\frac{2}{N}}\]
Because EAP statistics pertain to the broader population, they are not directly available for each test respondent. If person-level data is of particular research interest, plausible values (PVs) can be derived via the \(TAM\) R statistical package. Plausible values are random draws from the generated population parameter. This sampled data can be used in subsequent statistical modeling. For example, if the data were to be used in a multi-level model (MLM), then at least five different plausible value samples should be drawn. Thereafter, five respective MLMs should be run and the average result for each intercept, fixed effect, and variance components should be interpreted. In this case, the use of at least five different sets of plausible value is recommended (Wu, 2005; Laukaityte & Wiberg, 2017).
Finally, with the assistance of the R \(WrightMap\) package, a Wright Map is produced which maps the Reasoning ability of the population against the item steps for each of the 13 items. All results are now presented.
A total 3453 participated in the Reasoning exam. The mean overall total score was 35.14 and the standard deviation of the total score was 6.76. The frequency of categories for the partial credit items is given in Table 3.
| B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 | B10 | B11 | B12 | B13 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 90 | 87 | 22 | 14 | 35 | 93 | 128 | 56 | 321 | 166 | 301 | 74 | 171 |
| 1 | 292 | 506 | 129 | 47 | 204 | 474 | 524 | 287 | 1257 | 791 | 819 | 247 | 478 |
| 2 | 825 | 869 | 294 | 209 | 475 | 873 | 704 | 630 | 511 | 640 | 669 | 410 | 369 |
| 3 | 1670 | 1435 | 1696 | 1519 | 1703 | 1500 | 1551 | 1661 | 947 | 1390 | 1031 | 1364 | 1249 |
| 4 | 562 | 550 | 1306 | 1657 | 1029 | 495 | 534 | 809 | 411 | 459 | 625 | 1351 | 1179 |
The 13 point biserial and polyserial discrimination indices are included Table 4.
| Item Abbreviation | Point Biserial | Bi- or Poly-serial |
|---|---|---|
| B1 | 0.45 | 0.50 |
| B2 | 0.46 | 0.49 |
| B3 | 0.42 | 0.48 |
| B4 | 0.53 | 0.61 |
| B5 | 0.50 | 0.55 |
| B6 | 0.31 | 0.34 |
| B7 | 0.30 | 0.33 |
| B8 | 0.33 | 0.37 |
| B9 | 0.37 | 0.38 |
| B10 | 0.43 | 0.46 |
| B11 | 0.28 | 0.30 |
| B12 | 0.50 | 0.55 |
| B13 | 0.35 | 0.39 |
The item fit (mean square un-weighted and weighted) fit indices for the 3453 Reasoning items are provided in Table 5. Item infit and outfit t statistics and p values are also included.
| Item Abbreviation | Fit Group | Outfit | Outfit t | Outfit p | Infit | Intfit t | Intfit p |
|---|---|---|---|---|---|---|---|
| B1 | 1 | 0.99 | -0.48 | 0.63 | 0.95 | -2.07 | 0.04 |
| B2 | 2 | 0.96 | -1.72 | 0.09 | 0.95 | -2.25 | 0.02 |
| B3 | 3 | 0.94 | -1.98 | 0.05 | 0.95 | -1.82 | 0.07 |
| B4 | 4 | 0.82 | -6.19 | 0.00 | 0.85 | -4.74 | 0.00 |
| B5 | 5 | 0.89 | -4.11 | 0.00 | 0.89 | -4.01 | 0.00 |
| B6 | 6 | 1.10 | 4.23 | 0.00 | 1.08 | 3.42 | 0.00 |
| B7 | 7 | 1.15 | 6.15 | 0.00 | 1.11 | 4.67 | 0.00 |
| B8 | 8 | 1.07 | 2.76 | 0.01 | 1.05 | 1.76 | 0.08 |
| B9 | 9 | 1.10 | 4.55 | 0.00 | 1.08 | 3.95 | 0.00 |
| B10 | 10 | 1.03 | 1.52 | 0.13 | 1.00 | 0.22 | 0.82 |
| B11 | 11 | 1.25 | 10.64 | 0.00 | 1.20 | 9.13 | 0.00 |
| B12 | 12 | 0.90 | -3.49 | 0.00 | 0.91 | -3.34 | 0.00 |
| B13 | 13 | 1.17 | 5.87 | 0.00 | 1.10 | 4.14 | 0.00 |
The item outfit statistics are also graphed in Figure 1. The graph also includes the upper- and lower-bound 95% confidence intervals for the chi-square \((\chi^2)\) distribution (horizontal red lines).
Figure 1. Unweighted Fit Statistics
Based on the MML analysis, the mean ability (logit) of the population of interest was 0 (the mean ability is fixed to zero by default). The Reasoning logits ranged from -1.85 to 1.72 and the standard deviation was 0.6.
The item difficulties had a mean of -0.74 and a standard deviation of 0.44. The difficulty logits for each respective item are provided in Table 6 (the full set of item difficulties for each polytomous category is provided in Appendix A2).
| Item Abbreviation | Item Difficulties |
|---|---|
| B1 | -0.64 |
| B2 | -0.60 |
| B3 | -1.35 |
| B4 | -1.58 |
| B5 | -1.12 |
| B6 | -0.56 |
| B7 | -0.50 |
| B8 | -0.89 |
| B9 | -0.06 |
| B10 | -0.34 |
| B11 | -0.25 |
| B12 | -1.00 |
| B13 | -0.69 |
The Cronbach’s alpha was 0.77 and the population-related MML estimated reliability was 0.78 (Wu, Tam, and Jen, 2016). In addition, the person separation reliability estimate was 0.79.
The Wright Map is presented in Figure 1:
Figure 1. Wright Map for Reasoning Construct
Following are a set of graphs illustrating all item expected score curves.
Following are a set of graphs illustrating all item category characteristic curves. The full set of item parameters and standard errors for each item category can be found in Appendix A2.
The Thurstonian Thresholds are presented in Table 6.
| Cat1 | Cat2 | Cat3 | Cat4 | |
|---|---|---|---|---|
| B1 | -2.175751 | -1.3294373 | -0.4919128 | 1.4343567 |
| B2 | -2.476593 | -1.0641174 | -0.2469177 | 1.3794250 |
| B3 | -2.761505 | -1.7571716 | -1.2988586 | 0.4419250 |
| B4 | -2.637176 | -2.0988464 | -1.6495056 | 0.0482483 |
| B5 | -2.678925 | -1.5724182 | -0.9652405 | 0.7561340 |
| B6 | -2.381195 | -1.0930481 | -0.2672424 | 1.5083313 |
| B7 | -2.152313 | -0.9399719 | -0.3304138 | 1.4442444 |
| B8 | -2.506073 | -1.3845520 | -0.6927795 | 1.0302429 |
| B9 | -1.820526 | -0.1070251 | 0.2503967 | 1.4391174 |
| B10 | -2.157806 | -0.6253967 | -0.1255188 | 1.5379944 |
| B11 | -1.647125 | -0.4615173 | 0.0244446 | 1.0886536 |
| B12 | -2.181976 | -1.3003235 | -0.8584900 | 0.3316956 |
| B13 | -1.825836 | -0.8394470 | -0.5472107 | 0.4660950 |
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 46,443–59.
Brennan, R. L. (2001). Generalizability Theory. New York: Springer-Verlag.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. doi:10.1007/BF02310555
Laukaityte, I. & Marie Wiberg, M. (2017) Using plausible values in secondary analysis in large-scale assessments. Communications in Statistics - Theory and Methods, 46(22), 11341-11357. doi.10.1080/03610926.2016.1267764
Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Biometrika, 47, 337–347.
Revelle, W. (2018). psych: Procedures for Personality and Psychological Research, Version = 1.8.4. Northwestern University, Evanston, Illinois, USA. Retreived from https://CRAN.R-project.org/package=psych
Robitzsch, A., Kiefer, T., & Wu, M. (2018). TAM: Test analysis modules. R package version 2.10-24. Retreived from https://CRAN.R-project.org/package=TAM
Torres Irribarra, D. & Freund, R. (2014). Wright Map: IRT item-person map, Version = 1.2.1. Retreived from http://github.com/david-ti/wrightmap
Willse, J. T. (2018). CTT: Classical Test Theory Functions. R package version 2.3.2. Retreived from https://CRAN.R-project.org/package=CTT
Wu, M. (2005). The role of plausible values in large-scale surveys. Studies in Educational Evaluation, 31(2-3), 114-128. doi:10.1016/j.stueduc.2005.05.005
Wu, M. & Adams, R> J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339-355. doi:
Wu, M., Tam, H. P., & Jen, T-H. (2016). Educational Measurement for Applied Researchers: Theory into Practice. Singapore: Springer Nature.
Using the psych package, the Pearson correlation matrix is as follows:
| B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 | B10 | B11 | B12 | B13 | Row_Totals | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| B1 | 1.00 | 0.33 | 0.27 | 0.29 | 0.24 | 0.29 | 0.18 | 0.20 | 0.22 | 0.19 | 0.18 | 0.28 | 0.18 | 0.56 |
| B2 | 0.33 | 1.00 | 0.24 | 0.33 | 0.38 | 0.28 | 0.10 | 0.15 | 0.25 | 0.21 | 0.14 | 0.32 | 0.21 | 0.57 |
| B3 | 0.27 | 0.24 | 1.00 | 0.38 | 0.27 | 0.23 | 0.29 | 0.30 | 0.17 | 0.17 | 0.10 | 0.19 | 0.14 | 0.52 |
| B4 | 0.29 | 0.33 | 0.38 | 1.00 | 0.41 | 0.19 | 0.27 | 0.27 | 0.18 | 0.34 | 0.16 | 0.32 | 0.24 | 0.60 |
| B5 | 0.24 | 0.38 | 0.27 | 0.41 | 1.00 | 0.20 | 0.18 | 0.19 | 0.20 | 0.39 | 0.16 | 0.32 | 0.24 | 0.59 |
| B6 | 0.29 | 0.28 | 0.23 | 0.19 | 0.20 | 1.00 | 0.03 | 0.14 | 0.20 | 0.12 | 0.05 | 0.18 | 0.13 | 0.44 |
| B7 | 0.18 | 0.10 | 0.29 | 0.27 | 0.18 | 0.03 | 1.00 | 0.34 | 0.09 | 0.19 | 0.10 | 0.15 | 0.12 | 0.44 |
| B8 | 0.20 | 0.15 | 0.30 | 0.27 | 0.19 | 0.14 | 0.34 | 1.00 | 0.07 | 0.17 | 0.10 | 0.17 | 0.14 | 0.45 |
| B9 | 0.22 | 0.25 | 0.17 | 0.18 | 0.20 | 0.20 | 0.09 | 0.07 | 1.00 | 0.18 | 0.39 | 0.22 | 0.14 | 0.52 |
| B10 | 0.19 | 0.21 | 0.17 | 0.34 | 0.39 | 0.12 | 0.19 | 0.17 | 0.18 | 1.00 | 0.20 | 0.30 | 0.26 | 0.56 |
| B11 | 0.18 | 0.14 | 0.10 | 0.16 | 0.16 | 0.05 | 0.10 | 0.10 | 0.39 | 0.20 | 1.00 | 0.15 | 0.04 | 0.45 |
| B12 | 0.28 | 0.32 | 0.19 | 0.32 | 0.32 | 0.18 | 0.15 | 0.17 | 0.22 | 0.30 | 0.15 | 1.00 | 0.47 | 0.60 |
| B13 | 0.18 | 0.21 | 0.14 | 0.24 | 0.24 | 0.13 | 0.12 | 0.14 | 0.14 | 0.26 | 0.04 | 0.47 | 1.00 | 0.50 |
| Row_Totals | 0.56 | 0.57 | 0.52 | 0.60 | 0.59 | 0.44 | 0.44 | 0.45 | 0.52 | 0.56 | 0.45 | 0.60 | 0.50 | 1.00 |
Note. The bottom row in the correlation matrix are Pearson correlation coefficients between each respective item and item totals.
| xsi | se.xsi | |
|---|---|---|
| B1_Cat1 | -1.73 | 0.11 |
| B1_Cat2 | -1.36 | 0.06 |
| B1_Cat3 | -0.77 | 0.04 |
| B1_Cat4 | 1.32 | 0.05 |
| B2_Cat1 | -2.28 | 0.11 |
| B2_Cat2 | -0.83 | 0.05 |
| B2_Cat3 | -0.53 | 0.04 |
| B2_Cat4 | 1.23 | 0.05 |
| B3_Cat1 | -2.47 | 0.22 |
| B3_Cat2 | -1.29 | 0.09 |
| B3_Cat3 | -1.96 | 0.05 |
| B3_Cat4 | 0.34 | 0.04 |
| B4_Cat1 | -1.98 | 0.27 |
| B4_Cat2 | -2.01 | 0.13 |
| B4_Cat3 | -2.25 | 0.07 |
| B4_Cat4 | -0.06 | 0.04 |
| B5_Cat1 | -2.41 | 0.17 |
| B5_Cat2 | -1.26 | 0.07 |
| B5_Cat3 | -1.44 | 0.04 |
| B5_Cat4 | 0.64 | 0.04 |
| B6_Cat1 | -2.15 | 0.11 |
| B6_Cat2 | -0.90 | 0.05 |
| B6_Cat3 | -0.57 | 0.04 |
| B6_Cat4 | 1.38 | 0.05 |
| B7_Cat1 | -1.92 | 0.09 |
| B7_Cat2 | -0.58 | 0.05 |
| B7_Cat3 | -0.82 | 0.04 |
| B7_Cat4 | 1.33 | 0.05 |
| B8_Cat1 | -2.23 | 0.14 |
| B8_Cat2 | -1.15 | 0.06 |
| B8_Cat3 | -1.08 | 0.04 |
| B8_Cat4 | 0.90 | 0.04 |
| B9_Cat1 | -1.73 | 0.06 |
| B9_Cat2 | 0.77 | 0.04 |
| B9_Cat3 | -0.50 | 0.04 |
| B9_Cat4 | 1.24 | 0.06 |
| B10_Cat1 | -2.02 | 0.08 |
| B10_Cat2 | -0.02 | 0.04 |
| B10_Cat3 | -0.76 | 0.04 |
| B10_Cat4 | 1.42 | 0.05 |
| B11_Cat1 | -1.42 | 0.06 |
| B11_Cat2 | 0.00 | 0.04 |
| B11_Cat3 | -0.39 | 0.04 |
| B11_Cat4 | 0.82 | 0.05 |
| B12_Cat1 | -1.83 | 0.12 |
| B12_Cat2 | -0.91 | 0.06 |
| B12_Cat3 | -1.37 | 0.05 |
| B12_Cat4 | 0.12 | 0.04 |
| B13_Cat1 | -1.57 | 0.08 |
| B13_Cat2 | -0.07 | 0.05 |
| B13_Cat3 | -1.32 | 0.04 |
| B13_Cat4 | 0.22 | 0.04 |