Test equivalence of psychometric properties

Abstract

Background: Psychometric equivalence is essential in test construction to ensure that sampled items perform consistently across different test versions. This analysis evaluates the psychometric properties of 30 dichotomous (0/1) items from a sample of 1,000 participants by clustering items based on their difficulty and discrimination using Item Response Theory (IRT) models.

Methods: The analysis applied Categorical Principal Component Analysis (Princals) to assess unidimensionality, Exploratory Factor Analysis (EFA) to investigate factor structures, and Item Factor Analysis (IFA) to confirm model simplicity. IRT models, including the Rasch model, Two-Parameter Logistic (2PL), and Three-Parameter Logistic (3PL) models, were used to estimate item parameters and guide the clustering process. The 2PL model was selected for its ability to accommodate varying item discrimination, unlike the Rasch model, which assumes equal discrimination, and the 3PL model, which introduces a guessing parameter. Using KMeans clustering, items were categorized into three groups based on their discrimination and difficulty levels.

Results: The Princals analysis indicated that the items aligned along a single dimension, confirming unidimensionality. The EFA in combination with a scree plot also verified that a one-factor solution is the best for the data. The IFA suggested that a simpler factor structure provided a better fit. The Rasch model captured a wide range of item difficulties but was limited by its assumption of equal discrimination across items. The 2PL model provided the best fit, with discrimination parameters ranging from 0.778 to 1.862 and difficulty parameters covering a wide range. The guessing parameter in the 3PL model did not significantly improve the model fit. The Elbow and Silhouette methods showed that three clusters were optimal for KMeans clustering, facilitating balanced item sampling.

Conclusions: The 2PL model was the optimal framework for clustering the 30 items based on difficulty and discrimination. This approach enabled accurate grouping into three clusters, allowing future test forms to draw items from each level to ensure psychometric equivalence and improve the reliability and fairness of assessments while exactly measuring diverse participant abilities.

Descriptive statistics

Descriptive statistics

The dataset includes responses to 30 dichotomous (0/1) items (U1–U30) from 1,000 participants. Each row represents a participant’s response pattern, with “1” indicating a correct response and “0” indicating an incorrect response. Demographic data such as group and age are also given. The data can be used for psychometric analysis, particularly with Item Response Theory (IRT) models, to assess item difficulty and discrimination.

Sample of Dichotomous Item Responses
U1 U2 U3 U4 U5 U6 U7 U8 U9 U10 U11 U12 U13 U14 U15 U16 U17 U18 U19 U20 U21 U22 U23 U24 U25 U26 U27 U28 U29 U30 group Age
1 0 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 0 men 28
1 1 0 1 1 1 0 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 0 women 26
1 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 women 28
1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 women 30
1 1 0 1 0 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 men 26
0 0 0 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 men 26

Age and Group Statistics

Summary Statistics of Age
Min. 1st Qu. Median Mean 3rd Qu. Max.
20 26 28 27.727 30 32
Distribution of Group
Var1 Freq
men 500
women 500

The dataset consists of 1,000 participants with ages ranging from 20 to 32 years. The mean age is 27.73 years, and the median is 28 years. The distribution is relatively narrow, with most participants (63.3%) falling between 26 and 30 years, indicating a slightly younger sample. The group distribution is balanced, with exactly 500 women and 500 men.

Counting the 0’s and 1’s per Item

Counts of 0s and 1s per Item
Count_1s Count_0s
U1 501 499
U2 193 807
U3 470 530
U4 642 358
U5 293 707
U6 844 156
U7 315 685
U8 199 801
U9 635 365
U10 802 198
U11 782 218
U12 173 827
U13 757 243
U14 374 626
U15 654 346
U16 473 527
U17 674 326
U18 818 182
U19 520 480
U20 681 319
U21 833 167
U22 472 528
U23 298 702
U24 322 678
U25 181 819
U26 328 672
U27 535 465
U28 226 774
U29 698 302
U30 206 794

The distribution of responses across items U1 to U30 shows variability in the number of 0’s and 1’s. Items like U6 and U21 had a high proportion of 1’s (844 and 833), indicating they were easier for participants. In contrast, U12 and U25 had significantly fewer 1’s (173 and 181), suggesting higher difficulty.

This pattern shows that certain items were easier for participants, leading to higher correct response rates, while others created more challenges, resulting in a greater number of incorrect answers. Overall, the differences in response distribution indicate varying levels of item difficulty within the dataset.


Assessing Dimensionality

Assessing Dimensionality

Before fitting an Item Response Theory (IRT) model, it is crucial to assess the scale’s dimensionality. The following method were used: Categorical Principal Component Analysis (Princals), Exploratory Factor Analysis (EFA) and Item Factor Analysis (IFA) for analyzing latent item structures. These techniques ensure the scale is appropriate for IRT modeling.

Categorical Principal Component Analysis (Princals)

The PCA shows that all 30 items point in the same direction on the principal component plot, suggesting they measure a similar concept. This supports grouping the items into balanced clusters based on difficulty and discrimination, helping to create clusters with similar characteristics.

Exploratory Factor Analysis (EFA)

EFA Loadings Table
Factor 1 Factor 2 Factor 3
U1 0.4583751 0.4146866 0.2245282
U2 0.4749530 0.5012407 0.0883197
U3 0.4540790 0.3512264 0.0874065
U4 0.5368042 0.2194293 0.1355101
U5 0.4991854 0.3067454 0.1303310
U6 0.1765252 0.1424032 0.9713710
U7 0.3560481 0.3641058 0.0744484
U8 0.2052611 0.4936294 0.2133258
U9 0.4427127 0.2513244 0.2079537
U10 0.4360856 0.3907758 0.1537579
U11 0.5655336 0.1685652 0.1369274
U12 0.1731500 0.6412458 0.2160460
U13 0.4997258 0.3650578 0.0974844
U14 0.4238033 0.4193519 0.1390296
U15 0.6852040 0.2895528 0.1409246
U16 0.5054943 0.4780923 0.1753344
U17 0.4192744 0.4352794 0.0120122
U18 0.2765364 0.3262428 0.1715914
U19 0.4144009 0.2099004 -0.0399616
U20 0.6080735 0.2784103 0.0691154
U21 0.4073212 0.2540542 0.1202816
U22 0.5184947 0.3358727 0.2031776
U23 0.4023313 0.5120767 0.0276091
U24 0.4453539 0.3604365 0.2536687
U25 0.5685594 0.3351204 0.1343321
U26 0.5255998 0.3137475 0.1967931
U27 0.3768026 0.3910923 0.0724512
U28 0.5413010 0.3418858 0.2666009
U29 0.4252332 0.3504914 0.0695117
U30 0.3364113 0.6218859 -0.0226244

The eigenvalues indicate a sharp decline after the first factor, dropping from 10.62 (Factor 1) to 0.96 (Factor 2). This decrease suggests that Factor 1 explains the majority of the variance, while the factors afterwards contribute very little. The scree plot’s “elbow” at Factor 1 highlights this dramatic reduction in explained variance, showing that after the first factor, the other factors have little impact.

The EFA factor loadings follow the eigenvalue pattern, with Factor 1 showing strong loadings across most items (e.g., U15 = 0.685, U20 = 0.608). In contrast, Factors 2 and 3 have weak and inconsistent loadings, with only a few items moderately linked to these factors. For example, Factor 3 only shows a significant loading on U6 (0.971), which is not enough to define a meaningful factor.

The eigenvalue drop from 10.62 to 0.96 and the weak loadings for Factors 2 and 3, suggest that a single-factor model fits best. The scree plot and factor loadings show that the data doesn’t clearly split into three distinct factors, as the additional factors add very little. Therefore, a one-factor model is simpler and fits better.

Item Factor Analysis (IFA)

ANOVA Results for Item Factor Analysis
AIC SABIC HQ BIC logLik X2 df p
fitifa1 30805.21 30909.11 30917.12 31099.67 -15342.60 NA NA NA
fitifa2 30811.02 30965.14 30977.03 31247.81 -15316.51 52.185 29 0.005

Two models were tested, and the simpler model (fitifa1) had a lower AIC, indicating it was more efficient. The more complex model (fitifa2) had a significant chi-square result (p = 0.005), suggesting a poor fit despite having more parameters.

The IFA results support the simpler model (fitifa1), which captures the factor structure effectively without unnecessary complexity. The lack of improvement in the more complex model (fitifa2) suggests that adding more factors or parameters doesn’t significantly improve the fit. This indicates that the factor structure is relatively simple, and the items can be grouped using fewer factors.


The Rasch Model

The Rasch Model

Item Difficulty Parameters with 95% CI
Item Estimate Std..Error Lower.CI Upper.CI
U2 U2 1.8276169 0.0885824 1.6539954 2.0012384
U3 U3 0.1303256 0.0717695 -0.0103426 0.2709937
U4 U4 -0.7862235 0.0738644 -0.9309977 -0.6414493
U5 U5 1.1265567 0.0781315 0.9734190 1.2796943
U6 U6 -2.1491982 0.0935869 -2.3326285 -1.9657680
U7 U7 0.9914796 0.0767251 0.8410984 1.1418609
U8 U8 1.7798918 0.0876863 1.6080266 1.9517570
U9 U9 -0.7472956 0.0736144 -0.8915798 -0.6030114
U10 U10 -1.8031336 0.0862896 -1.9722613 -1.6340060
U11 U11 -1.6554300 0.0837052 -1.8194922 -1.4913679
U12 U12 1.9943255 0.0919516 1.8141004 2.1745505
U13 U13 -1.4821102 0.0810397 -1.6409480 -1.3232723
U14 U14 0.6493271 0.0739466 0.5043917 0.7942624
U15 U15 -0.8535999 0.0743329 -0.9992924 -0.7079074
U16 U16 0.1145091 0.0717409 -0.0261031 0.2551213
U17 U17 -0.9679428 0.0752333 -1.1154001 -0.8204854
U18 U18 -1.9284397 0.0887228 -2.1023363 -1.7545431
U19 U19 -0.1322831 0.0715829 -0.2725856 0.0080193
U20 U20 -1.0086460 0.0755867 -1.1567960 -0.8604961
U21 U21 -2.0528812 0.0913706 -2.2319675 -1.8737948
U22 U22 0.1197798 0.0717501 -0.0208505 0.2604101
U23 U23 1.0954197 0.0777908 0.9429497 1.2478898
U24 U24 0.9494793 0.0763245 0.7998832 1.0990753
U25 U25 1.9261390 0.0905277 1.7487046 2.1035734
U26 U26 0.9138198 0.0759976 0.7648646 1.0627751
U27 U27 -0.2110157 0.0716476 -0.3514449 -0.0705865
U28 U28 1.5759768 0.0841794 1.4109852 1.7409684
U29 U29 -1.1091829 0.0765347 -1.2591910 -0.9591747
U30 U30 1.7253959 0.0866988 1.5554661 1.8953256
Item Easiness Parameters with 95% CI
Item Estimate Std..Error Lower.CI Upper.CI
beta U1 beta U1 0.0326600 0.0715813 -0.1076394 0.1729594
beta U2 beta U2 -1.8276169 0.0885824 -2.0012384 -1.6539954
beta U3 beta U3 -0.1303256 0.0717695 -0.2709937 0.0103426
beta U4 beta U4 0.7862235 0.0738644 0.6414493 0.9309977
beta U5 beta U5 -1.1265567 0.0781315 -1.2796943 -0.9734190
beta U6 beta U6 2.1491982 0.0935869 1.9657680 2.3326285
beta U7 beta U7 -0.9914796 0.0767251 -1.1418609 -0.8410984
beta U8 beta U8 -1.7798918 0.0876863 -1.9517570 -1.6080266
beta U9 beta U9 0.7472956 0.0736144 0.6030114 0.8915798
beta U10 beta U10 1.8031336 0.0862896 1.6340060 1.9722613
beta U11 beta U11 1.6554300 0.0837052 1.4913679 1.8194922
beta U12 beta U12 -1.9943255 0.0919516 -2.1745505 -1.8141004
beta U13 beta U13 1.4821102 0.0810397 1.3232723 1.6409480
beta U14 beta U14 -0.6493271 0.0739466 -0.7942624 -0.5043917
beta U15 beta U15 0.8535999 0.0743329 0.7079074 0.9992924
beta U16 beta U16 -0.1145091 0.0717409 -0.2551213 0.0261031
beta U17 beta U17 0.9679428 0.0752333 0.8204854 1.1154001
beta U18 beta U18 1.9284397 0.0887228 1.7545431 2.1023363
beta U19 beta U19 0.1322831 0.0715829 -0.0080193 0.2725856
beta U20 beta U20 1.0086460 0.0755867 0.8604961 1.1567960
beta U21 beta U21 2.0528812 0.0913706 1.8737948 2.2319675
beta U22 beta U22 -0.1197798 0.0717501 -0.2604101 0.0208505
beta U23 beta U23 -1.0954197 0.0777908 -1.2478898 -0.9429497
beta U24 beta U24 -0.9494793 0.0763245 -1.0990753 -0.7998832
beta U25 beta U25 -1.9261390 0.0905277 -2.1035734 -1.7487046
beta U26 beta U26 -0.9138198 0.0759976 -1.0627751 -0.7648646
beta U27 beta U27 0.2110157 0.0716476 0.0705865 0.3514449
beta U28 beta U28 -1.5759768 0.0841794 -1.7409684 -1.4109852
beta U29 beta U29 1.1091829 0.0765347 0.9591747 1.2591910
beta U30 beta U30 -1.7253959 0.0866988 -1.8953256 -1.5554661

The Rasch model assumes equal discrimination for all items, with discrimination set to 1. Item difficulty (eta) ranges from -2.149 (U6) to 1.994 (U12), showing a wide range of difficulty levels.

While the Rasch model is good for estimating item difficulty, its assumption of equal discrimination makes it less suitable for this dataset. The variation in difficulty suggests some items are easier (U6) and others harder (U12). However, by assuming equal discrimination, the model doesn’t account for how well items distinguish between respondents with different abilities. This means the Rasch model may oversimplify the data, so it might not be the best choice for clustering based on difficulty and discrimination.


Two-Parameter Logistic Model

Two-Parameter Logistic Model

Summary of 2PL Model
Item Value Standard_error Z_value
Dffclt.U1 Dffclt.U1 -0.0008481 0.0610828 -0.013884
Dffclt.U2 Dffclt.U2 1.2687406 0.0901622 14.071760
Dffclt.U3 Dffclt.U3 0.1361012 0.0699564 1.945515
Dffclt.U4 Dffclt.U4 -0.6242749 0.0798422 -7.818856
Dffclt.U5 Dffclt.U5 0.9297283 0.0892332 10.419083
Dffclt.U6 Dffclt.U6 -2.1987512 0.2488342 -8.836208
Dffclt.U7 Dffclt.U7 0.9480237 0.1049707 9.031317
Dffclt.U8 Dffclt.U8 1.6488531 0.1552095 10.623404
Dffclt.U9 Dffclt.U9 -0.6270314 0.0849658 -7.379806
Dffclt.U10 Dffclt.U10 -1.3526105 0.1096486 -12.335868
Dffclt.U11 Dffclt.U11 -1.3128331 0.1134219 -11.574781
Dffclt.U12 Dffclt.U12 1.6436196 0.1378746 11.921116
Dffclt.U13 Dffclt.U13 -1.0858853 0.0911028 -11.919336
Dffclt.U14 Dffclt.U14 0.5363938 0.0728710 7.360864
Dffclt.U15 Dffclt.U15 -0.5365018 0.0598813 -8.959423
Dffclt.U16 Dffclt.U16 0.0976415 0.0562389 1.736190
Dffclt.U17 Dffclt.U17 -0.7645914 0.0839846 -9.103949
Dffclt.U18 Dffclt.U18 -1.9453258 0.2115523 -9.195482
Dffclt.U19 Dffclt.U19 -0.1104972 0.0930060 -1.188065
Dffclt.U20 Dffclt.U20 -0.7144043 0.0727424 -9.821011
Dffclt.U21 Dffclt.U21 -1.8194657 0.1740250 -10.455195
Dffclt.U22 Dffclt.U22 0.1128470 0.0621808 1.814822
Dffclt.U23 Dffclt.U23 0.8628041 0.0817048 10.560012
Dffclt.U24 Dffclt.U24 0.7597336 0.0786717 9.657008
Dffclt.U25 Dffclt.U25 1.3611040 0.0974469 13.967653
Dffclt.U26 Dffclt.U26 0.7116288 0.0745092 9.550886
Dffclt.U27 Dffclt.U27 -0.1587176 0.0749001 -2.119059
Dffclt.U28 Dffclt.U28 1.1332858 0.0862080 13.145950
Dffclt.U29 Dffclt.U29 -0.9326043 0.0976556 -9.549933
Dffclt.U30 Dffclt.U30 1.3144806 0.1029637 12.766445
Dscrmn.U1 Dscrmn.U1 1.4431905 0.1178069 12.250471
Dscrmn.U2 Dscrmn.U2 1.6390342 0.1486804 11.023876
Dscrmn.U3 Dscrmn.U3 1.1620367 0.1017599 11.419396
Dscrmn.U4 Dscrmn.U4 1.1862358 0.1064482 11.143784
Dscrmn.U5 Dscrmn.U5 1.2303873 0.1110062 11.083948
Dscrmn.U6 Dscrmn.U6 0.8738399 0.1135162 7.697929
Dscrmn.U7 Dscrmn.U7 0.9910584 0.0969653 10.220753
Dscrmn.U8 Dscrmn.U8 1.0154358 0.1087823 9.334565
Dscrmn.U9 Dscrmn.U9 1.0850009 0.1004593 10.800399
Dscrmn.U10 Dscrmn.U10 1.3647625 0.1334303 10.228283
Dscrmn.U11 Dscrmn.U11 1.2405463 0.1216962 10.193796
Dscrmn.U12 Dscrmn.U12 1.2056943 0.1234510 9.766585
Dscrmn.U13 Dscrmn.U13 1.4105109 0.1295309 10.889380
Dscrmn.U14 Dscrmn.U14 1.2741796 0.1092822 11.659530
Dscrmn.U15 Dscrmn.U15 1.8623600 0.1526118 12.203254
Dscrmn.U16 Dscrmn.U16 1.7056954 0.1344097 12.690273
Dscrmn.U17 Dscrmn.U17 1.2135765 0.1099691 11.035613
Dscrmn.U18 Dscrmn.U18 0.8848572 0.1085157 8.154189
Dscrmn.U19 Dscrmn.U19 0.7781632 0.0839225 9.272405
Dscrmn.U20 Dscrmn.U20 1.4692961 0.1254177 11.715217
Dscrmn.U21 Dscrmn.U21 1.0589969 0.1201081 8.817030
Dscrmn.U22 Dscrmn.U22 1.4077961 0.1155528 12.183141
Dscrmn.U23 Dscrmn.U23 1.3319425 0.1160957 11.472801
Dscrmn.U24 Dscrmn.U24 1.3076825 0.1132013 11.551837
Dscrmn.U25 Dscrmn.U25 1.5777773 0.1463352 10.781942
Dscrmn.U26 Dscrmn.U26 1.3736622 0.1164977 11.791320
Dscrmn.U27 Dscrmn.U27 1.0551151 0.0964734 10.936852
Dscrmn.U28 Dscrmn.U28 1.5408897 0.1365496 11.284472
Dscrmn.U29 Dscrmn.U29 1.1100887 0.1059345 10.479010
Dscrmn.U30 Dscrmn.U30 1.3802906 0.1288097 10.715736

The 2PL model estimates both item difficulty and discrimination, making it more flexible than the Rasch model. Discrimination ranges from 0.778 (U19) to 1.862 (U15), and difficulty varies widely across items. The model has a good fit, with a lower AIC (30806.27) than both the Rasch and 3PL models.

The 2PL model is the best fit for this data because it allows each item to have its own discrimination value, unlike the Rasch model. High-discrimination items (e.g., U15 and U16) better differentiate between participants of different abilities, while low-discrimination items (e.g., U19) are less reliable. The wide range of difficulty also shows that items cover a broad ability spectrum, making the 2PL model ideal for grouping items based on difficulty and discrimination.


Three-Parameter Logistic Model

Three-Parameter Logistic Model

Summary of 3PL Model
Item Value Standard_error Z_value
Gussng.U1 0.0000009 0.0003609 0.0025857
Gussng.U2 0.0000002 0.0000582 0.0031701
Gussng.U3 0.0357052 0.0632781 0.5642578
Gussng.U4 0.0001960 NaN NaN
Gussng.U5 0.0000000 0.0000194 0.0010626
Gussng.U6 0.0015934 0.0801375 0.0198834
Gussng.U7 0.0000431 0.0033458 0.0128825
Gussng.U8 0.0458528 0.0274881 1.6680980
Gussng.U9 0.0000029 0.0008204 0.0034785
Gussng.U10 0.2851522 0.1370039 2.0813445
Gussng.U11 0.0000032 0.0009037 0.0034951
Gussng.U12 0.0211417 0.0205739 1.0275975
Gussng.U13 0.1855913 0.1451978 1.2781969
Gussng.U14 0.0459580 0.0406205 1.1313990
Gussng.U15 0.0000048 0.0009051 0.0052610
Gussng.U16 0.0647455 0.0356979 1.8137065
Gussng.U17 0.2326992 0.0933739 2.4921230
Gussng.U18 0.3740668 0.2431689 1.5383005
Gussng.U19 0.1901005 0.1051988 1.8070599
Gussng.U20 0.0000008 0.0004119 0.0019622
Gussng.U21 0.0000156 0.0025430 0.0061353
Gussng.U22 0.0512474 0.0511980 1.0009661
Gussng.U23 0.0383912 0.0307128 1.2500078
Gussng.U24 0.0150560 0.0325145 0.4630556
Gussng.U25 0.0000001 0.0000436 0.0024084
Gussng.U26 0.0000007 0.0002423 0.0028031
Gussng.U27 0.2032610 0.0576739 3.5243161
Gussng.U28 0.0000071 0.0012296 0.0057429
Gussng.U29 0.1427725 0.1731639 0.8244932
Gussng.U30 0.0089231 0.0192527 0.4634710
Dffclt.U1 0.0119097 0.0601326 0.1980570
Dffclt.U2 1.2482655 0.0879533 14.1923610
Dffclt.U3 0.2214603 0.1507873 1.4686931
Dffclt.U4 -0.6136286 0.0700249 -8.7630005
Dffclt.U5 0.9197074 0.0868735 10.5867402
Dffclt.U6 -2.1935655 0.2917418 -7.5188580
Dffclt.U7 0.9366097 0.1021023 9.1732467
Dffclt.U8 1.5919940 0.1340811 11.8733703
Dffclt.U9 -0.6174472 0.0851973 -7.2472659
Dffclt.U10 -0.8258273 0.2935379 -2.8133584
Dffclt.U11 -1.3117914 0.1142150 -11.4852777
Dffclt.U12 1.5794644 0.1259695 12.5384683
Dffclt.U13 -0.7455169 0.2877414 -2.5909267
Dffclt.U14 0.6248992 0.1025810 6.0917611
Dffclt.U15 -0.5208393 0.0599754 -8.6842191
Dffclt.U16 0.2244899 0.0804311 2.7910815
Dffclt.U17 -0.2573540 0.2162693 -1.1899700
Dffclt.U18 -1.0038987 0.7850804 -1.2787208
Dffclt.U19 0.4474745 0.3085590 1.4502071
Dffclt.U20 -0.7028142 0.0730983 -9.6146460
Dffclt.U21 -1.8180335 0.1722984 -10.5516581
Dffclt.U22 0.2227416 0.1139132 1.9553616
Dffclt.U23 0.9106879 0.0887507 10.2611955
Dffclt.U24 0.7816172 0.0929963 8.4048211
Dffclt.U25 1.3373744 0.0950551 14.0694640
Dffclt.U26 0.7114651 0.0728489 9.7663077
Dffclt.U27 0.3234919 0.1439763 2.2468406
Dffclt.U28 1.1216178 0.0843141 13.3028452
Dffclt.U29 -0.6201115 0.3983104 -1.5568551
Dffclt.U30 1.2926744 0.0986429 13.1045811
Dscrmn.U1 1.4603442 0.1195811 12.2121638
Dscrmn.U2 1.6885134 0.1548361 10.9051677
Dscrmn.U3 1.2613939 0.1898812 6.6430674
Dscrmn.U4 1.1859815 0.1043215 11.3685256
Dscrmn.U5 1.2624734 0.1149204 10.9856294
Dscrmn.U6 0.8747084 0.1133934 7.7139283
Dscrmn.U7 1.0159475 0.1002129 10.1378939
Dscrmn.U8 1.3804055 0.2852295 4.8396307
Dscrmn.U9 1.0855059 0.1005572 10.7949052
Dscrmn.U10 1.6887556 0.3036448 5.5616156
Dscrmn.U11 1.2328643 0.1200453 10.2699879
Dscrmn.U12 1.4563134 0.2664375 5.4658720
Dscrmn.U13 1.6456813 0.2951939 5.5749172
Dscrmn.U14 1.4758585 0.2152583 6.8562199
Dscrmn.U15 1.8573273 0.1526182 12.1697658
Dscrmn.U16 2.0487720 0.2551623 8.0292905
Dscrmn.U17 1.6200036 0.2905938 5.5748050
Dscrmn.U18 1.0774335 0.2907628 3.7055417
Dscrmn.U19 1.0642060 0.2716899 3.9169878
Dscrmn.U20 1.4663331 0.1249919 11.7314238
Dscrmn.U21 1.0584922 0.1178840 8.9791004
Dscrmn.U22 1.5947631 0.2260503 7.0549044
Dscrmn.U23 1.5901450 0.2452181 6.4846162
Dscrmn.U24 1.4031319 0.1966532 7.1350558
Dscrmn.U25 1.6256323 0.1524307 10.6647299
Dscrmn.U26 1.3975489 0.1195062 11.6943602
Dscrmn.U27 1.5916525 0.2682938 5.9324987
Dscrmn.U28 1.5741189 0.1411647 11.1509409
Dscrmn.U29 1.2466152 0.2505410 4.9756932
Dscrmn.U30 1.5016014 0.2239273 6.7057534

The 3PL model includes a guessing parameter (c), which accounts for the chance of people guessing correctly, but this parameter is near zero for most items. The model’s AIC (30843.97) is higher than the 2PL model, indicating a worse fit.

Since the guessing parameter doesn’t significantly contribute, the 3PL model adds unnecessary complexity without improving fit. The higher AIC suggests it’s over-parameterized. This shows that guessing isn’t a major factor in how people answer, making the simpler 2PL model better for analyzing item properties and forming balanced item clusters without added complexity.


Cluster Analysis

Cluster Analysis

Item Difficulty and Discrimination Parameters
Item Difficulty Discrimination
U1 U1 -0.0008481 1.4431905
U2 U2 1.2687406 1.6390342
U3 U3 0.1361012 1.1620367
U4 U4 -0.6242749 1.1862358
U5 U5 0.9297283 1.2303873
U6 U6 -2.1987512 0.8738399
U7 U7 0.9480237 0.9910584
U8 U8 1.6488531 1.0154358
U9 U9 -0.6270314 1.0850009
U10 U10 -1.3526105 1.3647625
U11 U11 -1.3128331 1.2405463
U12 U12 1.6436196 1.2056943
U13 U13 -1.0858853 1.4105109
U14 U14 0.5363938 1.2741796
U15 U15 -0.5365018 1.8623600
U16 U16 0.0976415 1.7056954
U17 U17 -0.7645914 1.2135765
U18 U18 -1.9453258 0.8848572
U19 U19 -0.1104972 0.7781632
U20 U20 -0.7144043 1.4692961
U21 U21 -1.8194657 1.0589969
U22 U22 0.1128470 1.4077961
U23 U23 0.8628041 1.3319425
U24 U24 0.7597336 1.3076825
U25 U25 1.3611040 1.5777773
U26 U26 0.7116288 1.3736622
U27 U27 -0.1587176 1.0551151
U28 U28 1.1332858 1.5408897
U29 U29 -0.9326043 1.1100887
U30 U30 1.3144806 1.3802906

This analysis determines the best number of clusters for grouping test items based on their difficulty and discrimination using K-Means clustering. The Elbow Method indicates an inflection point at k = 3, meaning that adding more clusters does not significantly reduce within-cluster variance. The Silhouette Method also peaks at k = 3, confirming that this number offers the best balance between cluster cohesion and separation.

Therefore, three clusters provide the ideal solution for grouping items with similar characteristics, helping to balance difficulty and discrimination in item sampling.

Clustering & Conclusion

Clustering & Conclusion

Clusters
Item Difficulty Discrimination Cluster
U6 U6 -2.1987512 0.8738399 1
U10 U10 -1.3526105 1.3647625 1
U11 U11 -1.3128331 1.2405463 1
U13 U13 -1.0858853 1.4105109 1
U18 U18 -1.9453258 0.8848572 1
U21 U21 -1.8194657 1.0589969 1
U29 U29 -0.9326043 1.1100887 1
U1 U1 -0.0008481 1.4431905 2
U3 U3 0.1361012 1.1620367 2
U4 U4 -0.6242749 1.1862358 2
U9 U9 -0.6270314 1.0850009 2
U15 U15 -0.5365018 1.8623600 2
U16 U16 0.0976415 1.7056954 2
U17 U17 -0.7645914 1.2135765 2
U19 U19 -0.1104972 0.7781632 2
U20 U20 -0.7144043 1.4692961 2
U22 U22 0.1128470 1.4077961 2
U27 U27 -0.1587176 1.0551151 2
U2 U2 1.2687406 1.6390342 3
U5 U5 0.9297283 1.2303873 3
U7 U7 0.9480237 0.9910584 3
U8 U8 1.6488531 1.0154358 3
U12 U12 1.6436196 1.2056943 3
U14 U14 0.5363938 1.2741796 3
U23 U23 0.8628041 1.3319425 3
U24 U24 0.7597336 1.3076825 3
U25 U25 1.3611040 1.5777773 3
U26 U26 0.7116288 1.3736622 3
U28 U28 1.1332858 1.5408897 3
U30 U30 1.3144806 1.3802906 3

Clustering of Items Using the 2PL Model and K-Means (k = 3)

Items were grouped based on difficulty and discrimination parameters using the Two-Parameter Logistic (2PL) model and K-Means clustering with k = 3. The 2PL model accounts for differences in item discrimination, resulting in a more accurate psychometric evaluation.

Cluster 1 includes items with low discrimination and negative difficulty (e.g., U6, U18), indicating they are easier items that do not effectively differentiate between individuals. Cluster 2 contains items with moderate difficulty and high discrimination (e.g., U16, U22), while Cluster 3 features more difficult items (e.g., U2, U30) with strong discriminative power. This clustering approach enables balanced item sampling across various psychometric profiles.

Conclusion: Psychometric Equivalence via Clustering

In conclusion, the analysis successfully grouped 30 items based on their difficulty and ability to differentiate between test-takers, ensuring psychometric equivalence across test forms. Using KMeans clustering, supported by the Elbow and Silhouette methods, the items were divided into three balanced clusters. This approach ensures that each test form is fair and consistent, covering a wide range of ability levels while maintaining clear distinctions between performance levels. Overall, this approach supports the creation of psychometrically reliable and valid tests.