1. Purpose

The purpose of this document is to present the methodology and results of the analysis of the student response data for the Reasoning construct. The response data was drawn from a sample of the the world, and the Reasoning ability of the world was of research interest. Item Response Theory analysis (IRT) was undertaken and a 1 parameter logistic model (1PLM) was applied to the partial credit data.

2. Items Included in the Reasoning Instrument

Table 1 presents the 13 items included in the Reasoning instrument:

List of Items
Item abbreviation Full item description
B1 I make insightful remarks
B2 I know the answers to many questions
B3 I tend to analyze things
B4 I use my brain
B5 I learn quickly
B6 I counter others’ arguments
B7 I reflect on things before acting
B8 I weigh the pros against the cons
B9 I consider myself an average person
B10 I get confused easily
B11 I know that I am not a special person
B12 I have a poor vocabulary
B13 I skip difficult words while reading

3. Methodology

3.1 Statistical packages

Analysis was undertaken with the assistance of the \(CTT\) (Willse, 2018), \(psych\) (Revelle, 2018), \(TAM\) (Robitzsch, Kiefer, & Wu, 2018), and \(WrightMap\) (Torres Irribarra & Freund, 2018) R packages. The \(CTT\) package provided results pertaining to the frequency of response categories, point biserial (Pearson) discrimination indices, poly- and bi-serial discrimination indices, and the Cronbach’s alpha (\(\alpha\)) reliability coefficient. The full correlation matrix provided in Appendix A was created by the \(psych\) package, while the \(TAM\) package was used to carry out the 1PLM model for the partial credit data, and the \(WrightMap\) package provided for a visual illustration of the positions of items and persons.

3.2 Assessment of reliability

Instrument reliability is theoretically conceived as the proportion of true score variance \(Var[T]\) to observed score variance \(Var[O]\),

\[Reliability = \frac{Var[T]}{Var[O]}\]

Because it is impossible to identify true score variance, it needs to be estimated. Derived from Generalisability Theory (Brennan, 2001), the following formula is often applied to estimate the reliability of a test instrument,

\[ Reliability =\frac{Var[O]-Var[E]}{Var[O]}\]

where \(Var[O]\) represents the amount of observed variation in test ability, and \(Var[E]\) represents the amount of variation due to error. Equivalent to the formula above, the formula for the Cronbach’s alpha coefficient (Cronbach, 1951) is expressed as follows,

\[ \alpha =\frac{K\bar{c}}{(1+(K-1)\bar{c})}\]

where \(K\) is the number of items, and \(\bar{c}\) is the average of all the covariances between all items.

In accordance with DeVellis (2012), Table 2 presents some general rules-of-thumb for interpreting the meaning of the alpha \(\alpha\) coefficient for a test instrument.

Interpreting the Cronbach’s Alpha
Internal Consistency Cronbach’s alpha
Excellent 0.90 & up
Good 0.80-0.89
Acceptable 0.70-0.79
Questionable 0.60-0.69
Poor 0.50-0.59
Unacceptable under 0.50

It should be noted that the Cronbach’s \(\alpha\) coefficient assumes equivalent contribution of each item to the total score (tau, \(\tau\), equivalence), consequently the \(\alpha\) coefficient is considered lower bound statistic. Based on the 1PLM undertaken, the expected a posteriori (EAP; Bock & Aitken, 1981) estimate for instrument reliability is also estimated. In addition, the person-separation reliability statistic is also be used as an assessment for the degree to which the test instrument sufficiently separates students.

3.3 Assessment of item discrimination

The discrimination indices (correlation coefficients) provide an estimate of the contribution of each item to the construct of interest, Reasoning. The indices for each item were estimated by ommiting the item’s contribution to the summed total (often termed the item-rest correlations). This was done because inclusion of the item of interest in the summed total score artificially inflates each discrimination index. Because Pearson correlations provide attenuated estimates of the relationships involving categoric variables, polyserial correlations (and biserial correlations for dichotomous items) are also included as a means to assess item discrimination (Olsson, Drasgow, & Dorans, 1982). To assess the degree to which each discrimination index is inflated due to the inclusion of the item of interest, a full Pearson item-total correlation matrix is also generated with the assistance of the R \(psych\) package).

3.4 Item-response theory analysis

IRT analysis was undertaken with the assistance of the \(TAM\) package. Because the Reasoning ability of the the world was of interest, marginal maximum likelihood estimation (MML) was used. This form of estimation relies on the assumption that the construct of interest is distributed normally throughout the population (modelling of data incorporates this assumption through a Bayesian prior distributive assumption). Parameters pertaining directly to the population of interest are defined as expected a posteriori (EAP) statistics. For this level of analysis, an estimate of variance in student ability, EAP reliability estimates, item fit statistics (in- and out-fit), item expected score curves, item category characteristic curves, and Thurstonian thresholds are produced. To interpret item outfit, the confidence interval for the chi-square distribution is also included. The 95% confidence intervals for the outfit chi-square statistics were estimated in accordance with Wu and Adams (2013, p. 344),

\[95\hspace{.1cm}percent\hspace{.1cm}confidence\hspace{.1cm}intervals=1\pm2\sqrt{\frac{2}{N}}\]

Because EAP statistics pertain to the broader population, they are not directly available for each test respondent. If person-level data is of particular research interest, plausible values (PVs) can be derived via the \(TAM\) R statistical package. Plausible values are random draws from the generated population parameter. This sampled data can be used in subsequent statistical modeling. For example, if the data were to be used in a multi-level model (MLM), then at least five different plausible value samples should be drawn. Thereafter, five respective MLMs should be run and the average result for each intercept, fixed effect, and variance components should be interpreted. In this case, the use of at least five different sets of plausible value is recommended (Wu, 2005; Laukaityte & Wiberg, 2017).

Finally, with the assistance of the R \(WrightMap\) package, a Wright Map is produced which maps the Reasoning ability of the population against the item steps for each of the 13 items. All results are now presented.

4. Results

4.1 Observed scores

A total 3453 participated in the Reasoning exam. The mean overall total score was 35.14 and the standard deviation of the total score was 6.76. The frequency of categories for the partial credit items is given in Table 3.

Frequency of Categories for Each Item
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13
0 90 87 22 14 35 93 128 56 321 166 301 74 171
1 292 506 129 47 204 474 524 287 1257 791 819 247 478
2 825 869 294 209 475 873 704 630 511 640 669 410 369
3 1670 1435 1696 1519 1703 1500 1551 1661 947 1390 1031 1364 1249
4 562 550 1306 1657 1029 495 534 809 411 459 625 1351 1179

4.2 Point biserial discrimination indices

The 13 point biserial and polyserial discrimination indices are included Table 4.

Point Biserial, and Bi- and Poly-serial Discrimination Indices
Item Abbreviation Point Biserial Bi- or Poly-serial
B1 0.45 0.50
B2 0.46 0.49
B3 0.42 0.48
B4 0.53 0.61
B5 0.50 0.55
B6 0.31 0.34
B7 0.30 0.33
B8 0.33 0.37
B9 0.37 0.38
B10 0.43 0.46
B11 0.28 0.30
B12 0.50 0.55
B13 0.35 0.39

4.3 Item fit for the Reasoning items

The item fit (mean square un-weighted and weighted) fit indices for the 3453 Reasoning items are provided in Table 5. Item infit and outfit t statistics and p values are also included.

Item Fit Statistics
Item Abbreviation Fit Group Outfit Outfit t Outfit p Infit Intfit t Intfit p
B1 1 0.99 -0.48 0.63 0.95 -2.07 0.04
B2 2 0.96 -1.72 0.09 0.95 -2.25 0.02
B3 3 0.94 -1.98 0.05 0.95 -1.82 0.07
B4 4 0.82 -6.19 0.00 0.85 -4.74 0.00
B5 5 0.89 -4.11 0.00 0.89 -4.01 0.00
B6 6 1.10 4.23 0.00 1.08 3.42 0.00
B7 7 1.15 6.15 0.00 1.11 4.67 0.00
B8 8 1.07 2.76 0.01 1.05 1.76 0.08
B9 9 1.10 4.55 0.00 1.08 3.95 0.00
B10 10 1.03 1.52 0.13 1.00 0.22 0.82
B11 11 1.25 10.64 0.00 1.20 9.13 0.00
B12 12 0.90 -3.49 0.00 0.91 -3.34 0.00
B13 13 1.17 5.87 0.00 1.10 4.14 0.00

The item outfit statistics are also graphed in Figure 1. The graph also includes the upper- and lower-bound 95% confidence intervals for the chi-square \((\chi^2)\) distribution (horizontal red lines).

Figure 1. Unweighted Fit Statistics

4.4 Results for IRT student ability and item difficulty estimates

Based on the MML analysis, the mean ability (logit) of the population of interest was 0 (the mean ability is fixed to zero by default). The Reasoning logits ranged from -1.85 to 1.72 and the standard deviation was 0.6.

The item difficulties had a mean of -0.74 and a standard deviation of 0.44. The difficulty logits for each respective item are provided in Table 6 (the full set of item difficulties for each polytomous category is provided in Appendix A2).

Item Difficulties
Item Abbreviation Item Difficulties
B1 -0.64
B2 -0.60
B3 -1.35
B4 -1.58
B5 -1.12
B6 -0.56
B7 -0.50
B8 -0.89
B9 -0.06
B10 -0.34
B11 -0.25
B12 -1.00
B13 -0.69

The Cronbach’s alpha was 0.77 and the population-related MML estimated reliability was 0.78 (Wu, Tam, and Jen, 2016). In addition, the person separation reliability estimate was 0.79.

4.5 Wright map

The Wright Map is presented in Figure 1:

Figure 1. Wright Map for Reasoning Construct

4.6 Item expected score curves

Following are a set of graphs illustrating all item expected score curves.

4.7 Polytomous item category characteristic curves

Following are a set of graphs illustrating all item category characteristic curves. The full set of item parameters and standard errors for each item category can be found in Appendix A2.

4.8 Thurstonian thresholds

The Thurstonian Thresholds are presented in Table 6.

Thurstonian Thresholds
Cat1 Cat2 Cat3 Cat4
B1 -2.175751 -1.3294373 -0.4919128 1.4343567
B2 -2.476593 -1.0641174 -0.2469177 1.3794250
B3 -2.761505 -1.7571716 -1.2988586 0.4419250
B4 -2.637176 -2.0988464 -1.6495056 0.0482483
B5 -2.678925 -1.5724182 -0.9652405 0.7561340
B6 -2.381195 -1.0930481 -0.2672424 1.5083313
B7 -2.152313 -0.9399719 -0.3304138 1.4442444
B8 -2.506073 -1.3845520 -0.6927795 1.0302429
B9 -1.820526 -0.1070251 0.2503967 1.4391174
B10 -2.157806 -0.6253967 -0.1255188 1.5379944
B11 -1.647125 -0.4615173 0.0244446 1.0886536
B12 -2.181976 -1.3003235 -0.8584900 0.3316956
B13 -1.825836 -0.8394470 -0.5472107 0.4660950

5. References

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 46,443–59.

Brennan, R. L. (2001). Generalizability Theory. New York: Springer-Verlag.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. doi:10.1007/BF02310555

Laukaityte, I. & Marie Wiberg, M. (2017) Using plausible values in secondary analysis in large-scale assessments. Communications in Statistics - Theory and Methods, 46(22), 11341-11357. doi.10.1080/03610926.2016.1267764

Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Biometrika, 47, 337–347.

Revelle, W. (2018). psych: Procedures for Personality and Psychological Research, Version = 1.8.4. Northwestern University, Evanston, Illinois, USA. Retreived from https://CRAN.R-project.org/package=psych

Robitzsch, A., Kiefer, T., & Wu, M. (2018). TAM: Test analysis modules. R package version 2.10-24. Retreived from https://CRAN.R-project.org/package=TAM

Torres Irribarra, D. & Freund, R. (2014). Wright Map: IRT item-person map, Version = 1.2.1. Retreived from http://github.com/david-ti/wrightmap

Willse, J. T. (2018). CTT: Classical Test Theory Functions. R package version 2.3.2. Retreived from https://CRAN.R-project.org/package=CTT

Wu, M. (2005). The role of plausible values in large-scale surveys. Studies in Educational Evaluation, 31(2-3), 114-128. doi:10.1016/j.stueduc.2005.05.005

Wu, M. & Adams, R> J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339-355. doi:

Wu, M., Tam, H. P., & Jen, T-H. (2016). Educational Measurement for Applied Researchers: Theory into Practice. Singapore: Springer Nature.

6. Appendices

A1. Pearson Correlation Matrix

Using the psych package, the Pearson correlation matrix is as follows:

Pearson Correlation Matrix
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 Row_Totals
B1 1.00 0.33 0.27 0.29 0.24 0.29 0.18 0.20 0.22 0.19 0.18 0.28 0.18 0.56
B2 0.33 1.00 0.24 0.33 0.38 0.28 0.10 0.15 0.25 0.21 0.14 0.32 0.21 0.57
B3 0.27 0.24 1.00 0.38 0.27 0.23 0.29 0.30 0.17 0.17 0.10 0.19 0.14 0.52
B4 0.29 0.33 0.38 1.00 0.41 0.19 0.27 0.27 0.18 0.34 0.16 0.32 0.24 0.60
B5 0.24 0.38 0.27 0.41 1.00 0.20 0.18 0.19 0.20 0.39 0.16 0.32 0.24 0.59
B6 0.29 0.28 0.23 0.19 0.20 1.00 0.03 0.14 0.20 0.12 0.05 0.18 0.13 0.44
B7 0.18 0.10 0.29 0.27 0.18 0.03 1.00 0.34 0.09 0.19 0.10 0.15 0.12 0.44
B8 0.20 0.15 0.30 0.27 0.19 0.14 0.34 1.00 0.07 0.17 0.10 0.17 0.14 0.45
B9 0.22 0.25 0.17 0.18 0.20 0.20 0.09 0.07 1.00 0.18 0.39 0.22 0.14 0.52
B10 0.19 0.21 0.17 0.34 0.39 0.12 0.19 0.17 0.18 1.00 0.20 0.30 0.26 0.56
B11 0.18 0.14 0.10 0.16 0.16 0.05 0.10 0.10 0.39 0.20 1.00 0.15 0.04 0.45
B12 0.28 0.32 0.19 0.32 0.32 0.18 0.15 0.17 0.22 0.30 0.15 1.00 0.47 0.60
B13 0.18 0.21 0.14 0.24 0.24 0.13 0.12 0.14 0.14 0.26 0.04 0.47 1.00 0.50
Row_Totals 0.56 0.57 0.52 0.60 0.59 0.44 0.44 0.45 0.52 0.56 0.45 0.60 0.50 1.00

Note. The bottom row in the correlation matrix are Pearson correlation coefficients between each respective item and item totals.

A2. The Item Category Parameters and Standard Errors

Item Category Parameters and Standard Errors
xsi se.xsi
B1_Cat1 -1.73 0.11
B1_Cat2 -1.36 0.06
B1_Cat3 -0.77 0.04
B1_Cat4 1.32 0.05
B2_Cat1 -2.28 0.11
B2_Cat2 -0.83 0.05
B2_Cat3 -0.53 0.04
B2_Cat4 1.23 0.05
B3_Cat1 -2.47 0.22
B3_Cat2 -1.29 0.09
B3_Cat3 -1.96 0.05
B3_Cat4 0.34 0.04
B4_Cat1 -1.98 0.27
B4_Cat2 -2.01 0.13
B4_Cat3 -2.25 0.07
B4_Cat4 -0.06 0.04
B5_Cat1 -2.41 0.17
B5_Cat2 -1.26 0.07
B5_Cat3 -1.44 0.04
B5_Cat4 0.64 0.04
B6_Cat1 -2.15 0.11
B6_Cat2 -0.90 0.05
B6_Cat3 -0.57 0.04
B6_Cat4 1.38 0.05
B7_Cat1 -1.92 0.09
B7_Cat2 -0.58 0.05
B7_Cat3 -0.82 0.04
B7_Cat4 1.33 0.05
B8_Cat1 -2.23 0.14
B8_Cat2 -1.15 0.06
B8_Cat3 -1.08 0.04
B8_Cat4 0.90 0.04
B9_Cat1 -1.73 0.06
B9_Cat2 0.77 0.04
B9_Cat3 -0.50 0.04
B9_Cat4 1.24 0.06
B10_Cat1 -2.02 0.08
B10_Cat2 -0.02 0.04
B10_Cat3 -0.76 0.04
B10_Cat4 1.42 0.05
B11_Cat1 -1.42 0.06
B11_Cat2 0.00 0.04
B11_Cat3 -0.39 0.04
B11_Cat4 0.82 0.05
B12_Cat1 -1.83 0.12
B12_Cat2 -0.91 0.06
B12_Cat3 -1.37 0.05
B12_Cat4 0.12 0.04
B13_Cat1 -1.57 0.08
B13_Cat2 -0.07 0.05
B13_Cat3 -1.32 0.04
B13_Cat4 0.22 0.04