Psychometrics Analyses of LLM benchmark data

Author

Tom, Aron, Julius

Notes

## global variables: 
# only if needed

Theory

Lines of arguments

Contradictions in the fields of Psychology <-> Machine Learning:

From my perspective, particularly within specific sub-disciplines of Psychology, the primary objective of statistical methods is to elucidate psychological mechanisms. For instance, in Experimental Psychology (Allgemeine Psychologie), researchers manipulate specific experimental conditions (independent variables) and measure the effects of these manipulations, ideally within randomized controlled trials (RCTs). Disciplines like Machine Learning, prioritize however prediction. Despite these differences, both fields can benefit from one another, as highlighted by Yarkoni and Westfall (2017).

Psychology has historically been concerned, first and foremost, with explaining the causal mechanisms that give rise to behavior. Randomized, tightly controlled experiments are enshrined as the gold standard of psychological research, and there are endless investigations of the various mediating and moderating variables that govern various behaviors. We argue that psychology’s near-total focus on explaining the causes of behavior has led much of the field to be populated by research programs that provide intricate theories of psychological mechanism but that have little (or unknown) ability to predict future behaviors with any appreciable accuracy. We propose that principles and techniques from the field of machine learning can help psychology become a more predictive science. We review some of the fundamental concepts and tools of machine learning and point out examples where these concepts have been used to conduct interesting and important psychological research that focuses on predictive research questions. We suggest that an increased focus on prediction, rather than explanation, can ultimately lead us to greater understanding of behavior.

However, an intense focus on prediction - especially when applying “black box” models - could spark criticism within the field. Potential critiques include:

Lack of Interpretability: Black box models often provide little to no insight into the underlying mechanisms driving predictions, which contradicts the psychological emphasis on understanding (causal) relationships.
Theoretical Disconnect: Machine learning models may not align with (or care of) established psychological theories, leading to a fragmentation between predictive accuracy and theoretical coherence.
Ethical Concerns: The deployment of predictive models without a clear understanding of their decision-making processes can lead to ethical dilemmas, particularly in areas such as clinical psychology.
…

Combing methods from multiple fields it is possible to adopt a more integrative approach, when investigating research questions/ analyzing data: it is possible to categorize types of research questions and their corresponding statistical models (e.g., EFA for exploration, ANOVA in context of an RCT to discover a “causal” effect, machine learning to maximize prediction, cluster analyses for identifying homogeneous subgroups,…) and every method can give important insights:

Data Analysis Flowchart by Leek and Peng (2015)

Contrary to the perception of data analysis as a linear process, it is inherently a highly iterative process where insights gained at each step lead to re-evaluating and refining previous steps before moving forward. This cyclical approach ensures continuous learning and adjustment, in contrast to a seemingly straightforward, linear application of specific statistical models like a cooking recipe:

Epicycles of Analysis by Peng and Matsui (2016)

To summarize, in my opinion several factors may contribute to the (at least implicit or often pragmatic driven) rejection of methodologies from other fields and their respective “research philosophies”. These factors reflect both structural and cultural pressures within academia:

Pressure and Working Conditions in Academia: Researchers, particularly early-career scientists, face significant pressure to “publish or perish,” often under challenging working conditions. For example:
- In Germany, as reported by Konsortium Bundesbericht Wissenschaftlicher Nachwuchs (2021), approximately 90% of early-career researchers are employed on fixed-term contracts, with PhD students averaging contract lengths of only 22 months. Further despite a strong desire to start families, particularly among women, career uncertainties, poor work-life balance, and financial instability are mentioned as key reasons for the high rate of childlessness among young female researchers (see also Sengewald et al. 2024)
- Internationally, similar challenges have been highlighted in recent work, such as Rahal et al. (2023), which discusses ways to redesign academic systems, emphasizing the role of permanent employment
Conservatism in Methodology: A common response to the pressure to “publish or perish” is to adhere to familiar statistical models, technological tools, and the prevailing working styles of one’s research group. This methodological conservatism helps reduce the risks associated with exploring unfamiliar or cross-disciplinary approaches, but it may limit innovation and interdisciplinary collaboration. Next to limited contracts and insecure working conditions, this methodological rigidity could also be explained by multiple sociological theories:
- Bourdieu’s Theory of Fields suggests that the academic field operates under specific power dynamics, where certain styles of working gain dominance based on their alignment with the interests of those in power (in the center of the field). Further academics feel pressure to conform to established norms within their discipline (Fligstein and McAdam 2015), see also YouTube Video ” Field theory - Pierre Bourdieu”
- Luhmann’s Theory of the Differentiation of Social Systems posits that academia, like other social systems, differentiates itself through specialized methodologies that reinforce the boundaries between disciplines (Luhmann 1987)
- Historical factors, including power-related dynamics, also play a role. For instance, Wallerstein’s work on European Universalism discusses how power structures have historically influenced which ideas and methodologies become dominant in different fields (Wallerstein 2006)

brief theory of latent variable models

In context of answering survey question the Cognitive Aspects of Survey Methodology (also called „Optimizing-Satisficing-Model“) tries to explain how people finally arrive at a response, which is a great heuristic for identifying possible sources of error (Tourangeau, Rips, and Rasinski 2000,; Moosbrugger and Kelava 2020):

Despite not knowing the exact processes of answering a question (black box), we fundamentally assume that the answer / reporting depends on a score on a latent variable:

the central task of test theory is to determine the relationship between test behaviour and the (psychological) characteristic to be assessed

Latent variable definition: random variables whose realized values are hidden (Bollen 2002; Borsboom 2008)

A possible operationalisation of a latent variable is a linear measurement model with the equation Y = \lambda * \eta + \epsilon.

this corresponds to the fundamental equation of Classical Test Theory: Y = T + \epsilon

\rightarrow models which are dealing with latent variables are called Latent Variable Models (Skrondal and Rabe-Hesketh 2004, 2007), whereby the central aim of latent variable models is to infer unobservable (latent) psychological traits or abilities from responses to test items

Latent variable models Skrondal and Rabe-Hesketh (2007)

theoretical aspects “What is a Latent Variable?”

the theoretical status of latent variables has not been clarified (Schurig 2017)
- are these variables representations of real entities or just useful inventions?
- Advantage: use of latent variables allow more generalisable reasoning than manifest variables
latent variables need a substantive scientific foundation, whereby the bridging problem between observed and latent variable must be solved by theoretical assumptions and statistical modelling

centrally by assuming the local independence assumption in latent variable models it is assumed that, once the latent variable (e.g., a psychological trait or ability) is accounted for, the observed variables (such as responses to test items) are statistically independent of each other. This means that any correlation between the observed variables is fully explained by the latent variable, and no further direct relationships exist between the observed variables. In essence, the latent variable “absorbs” the shared variance, allowing the model to “get rid” of any inter-dependencies among the observed variables, simplifying the analysis (Skrondal and Rabe-Hesketh 2007).

Pr(y_j \mid \eta_j) = \prod_{i=1}^{n} Pr(y_{ij} \mid \eta_j)

This assumption do not hold in more complex data sets (e.g., multi-dimensionality, response styles, …) and by more complex models (e.g., the bifactor model) the variance of indicators of single measurment models (CFAs) is divided into common and unique variance:

The variance shared between the indicators is the commonality; the remaining variance is the unique variance, which is divided into indicator-specific method variance (specific) and measurement variance (error).

quality criteria of measurments

measurements, test administration should be carried out taking into account three central quality criteria of tests, which build on each other (no reliable measurement possible without objectivity, etc.):

\text{objectivity} \rightarrow \text{reliability} \rightarrow \text{validity}

Objectivity: Test score is objective if it is independent of any influences outside the tested person (e.g., situational conditions, experimenter – all exogenous variables whose covariance structure is not explained by the statistical model, including error terms, unobserved influencing variables, and exogenous latent constructs)
- Implementation objectivity (Durchführungsobjektivität): Standardization of implementation conditions (writing a test manual, training test leader, standardization of all other conditions).
- Objectivity of evaluation (Auswertungsobjektivität): The interpretation of the test result is not dependent on the person who evaluates the test (measurable by inter-rater reliability, such as Kendall’s coefficient of concordance).
- Objectivity of interpretation (Interpretationsobjektivität): Different test users come to the same conclusions with identical test scores.
Reliability: The extent to which a test measures what it is intended to measure. The focus here is on measurement accuracy. Reliability is demonstrated theoretically by the fact that repeated measurements under the same conditions produce the same measurement results (the central contribution to the development of reliability measurement is made by classical test theory, which establishes a theory of measurement error).
- Reliability can be estimated by different methods, often as a measure of internal consistency - Cronbach’s Alpha (a measure of how items in a scale correlate with one another) is commonly used.

The classical test theory assumes that the test performance of a person on the question i is composed of x_{i} = \tau_{i} + \epsilon_{i}. Here \tau_{i} corresponds to the person’s true score on question i, which is composed of an item response x_{i} and the error \epsilon_{i}, where the error is unbiased (if there are systematic aspects in the errors, apply models with can do variance splitting)

Validity: A test is considered valid if it actually measures the characteristic it is supposed to measure and not some other characteristic. The measurement of validity is done in two steps:
- Via structure-searching (such as exploratory factor analysis) and structure-testing (such as confirmatory factor analysis) procedures, construct validity is determined. This indicates the extent to which conclusions can be drawn from test results, for example, about psychological personality traits.
- The agreement of the results of the individual constructs should be high with constructs that measure the same or similar characteristics (convergent validity), and the agreement with results from constructs that measure other characteristics should be low (discriminant validity). This can be analyzed via (latent) correlations, see also construct validity (Cronbach and Meehl 1955)

Importantly more recently there is also an argument-based approach to validation. To validate an interpretation or use of measurements is to evaluate the rationale, or argument, for the proposed conclusions and decisions … Ultimately, the need for validation derives from the scientific and social requirement that public claims and decisions need to be justified:

interpretive argument: specifies the proposed interpretations and applications of assessment results by laying out a network of inferences and assumptions leading from the observed performances to the conclusions and decisions based on the assessment scores
validity argument: provides an evaluation of the interpretive argument’s coherence and the plausibility of its inferences and assumptions

A variety of further quality criteria of indicators were developed by the „Key National Indicators Initiative“ (outdated, ~ 2005):

Key National Indicators Initiative, by Groves and Lyberg (2010)

To reflect on all possible factors influencing the quality of a survey, there is also the concept of the Total Survey Error (see Biemer et al. 2017; Groves and Lyberg 2010).

assumptions 1PL, 2PL model

Before running any statistical model, normally - if model is sensitive to the violation of a specific assumption - the assumptions of the models are tested; for the Raschmodell (1PL) we have the following assumptions (aus Julius Masterarbeit):

Eindimensionalität: Die Lösungswahrscheinlichkeit hängt nur von einem latenten Merkmal ab und ist durch die Modellparameter \theta_v, \beta_i bestimmt. Neben den Modellparametern liegen keine weiteren beeinflussenden Variablen \varphi vor: P(X_{vi} = 1 \mid \theta_v, \beta_i, \varphi) = (X_{vi} = 1 \mid \theta_v, \beta_i)
Lokale stochastische Unabhängigkeit: Wenn die Personenfähigkeit \theta_v auf einem Wert konstant gehalten wird, verschwindet die Korrelation zwischen jedem möglichen Itempaar X_{vi}, X_{vj} im Test (wobei i \neq j): P(X_{vi} \perp X_{vj} \mid \theta_v), \forall \,\, i, j
Suffizienz der Summenscores: Die Summenscores R_v = \sum_{i=1}^{k} X_{vi} eines Tests der Länge k sind ausreichend für die Schätzung der Personenfähigkeit einer Person \theta_v. Das Gleiche gilt analog für die Itemscores C_i = \sum_{v=1}^{n} X_{vi}
Monotonie: Die Lösungswahrscheinlichkeit eines Items x_{vi} erhöht sich monoton mit höheren Personenfähigkeitswerten \theta. Je fähiger eine Person ist, desto wahrscheinlicher wird diese ein Item beantworten können. Dies drückt sich in der ICC f(x_{vi} \mid \theta_v, \beta_i) wie folgt aus: \theta_v > \theta_w: f(x_{vi} \mid \theta_v, \beta_i) > f(x_{wi} \mid \theta_w, \beta_i), \forall \,\, \theta_v, \theta_w

In the 2PL model, each item has additionally its own discrimination parameter, allowing some items to be better at differentiating between individuals with slightly different abilities.

If now the 1PL or 2PL is not fitting, this could have multiple reasons:

Multidimensionality, items are influenced by more than one latent trait
Poorly Fitting Items or Data, some items behave in an unexpected or erratic manner, it may not exhibit the expected increasing probability of a correct response as ability increases, especially if the item discrimnation parameter \alpha_i is estimated negative (Monotonicity violated)
Ceiling or Floor Effects, test contains items that are either too easy or too difficult for the population being measured, the probability of a correct response might not vary meaningfully with ability over a range of \theta values.
…

load packages, data

# sets the directory of location of this script as the current directory
# setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # not needed in Quatro env.

### load packages
#> This function is a wrapper for library and require. It checks to see if a package is installed, if not it attempts to install the package

require(pacman)
p_load('mirt', 'ggplot2', 'dplyr', 'tidyr', 'parallel', 'psych',
       'rpart', 'rpart.plot')

### load, prepare data
setwd("data")

#> long format
data_long <- read.csv("mmlu_data.csv")
data_long <- select(data_long, -"model")

#> remove items
items_to_remove <- data_long %>%
  group_by(doc_id) %>%
  summarize(mean = mean(acc), variance = var(acc)) 
# any items which are solved by every LLM?
cat("number of items solved by all LLMs:", sum(items_to_remove$mean == 1), "\n")

number of items solved by all LLMs: 0

items_to_remove <- items_to_remove %>%
  filter(variance == 0) %>% 
  select(doc_id) %>%
  unlist() %>%
  as.character()

data_long <- data_long %>%
  filter(!doc_id %in% items_to_remove)

cat("number of items which have been removed:", length(items_to_remove), "\n")

number of items which have been removed: 54

#> wide format
data_wide <- spread(data_long, doc_id, acc)
# data_wide <- select(data_wide, -"model_id")

making one suspicious

1PL, 2PL over subsample

over some sets of items the 2PL model do converge:

! but, the models shows

bad item fit statistics:
- S_{X2}: This is the test statistic for the S_{X2} test. It compares observed and expected item response patterns under the model, a high p-value (above 0.05) suggests that the item fits the model well.
- poor Item Fit (p-values < 0.05) for items 8, 27, 13, 22, 3
- item 3 has missing values (NaN) for theS_{X2} test and its degrees of freedom. This could indicate a lack of variability in the responses or issues with the data for this specific item.
too large item difficulties and / or item discrimination parameters

### sub-sample items
set.seed(12345) # to replicate results

setSize <- 30

random_items <- sample(x = 2:ncol(data_wide), size = setSize, replace = FALSE) # first variable is the model id
sub_dat <- data_wide[, random_items]

# Fit 1PL model (Rasch model)
fit1PLa <- mirt(data = sub_dat,
               model = 1,  # one-dimensional model
               itemtype = "Rasch",
               verbose = FALSE)

# Fit 2PL model
fit2PLa <- mirt(data = sub_dat,
               model = 1,  #one-dimensional
               itemtype = "2PL",
               technical = list(NCYCLES = 2000, SEtol = 1e-4),
               verbose = FALSE)


### estimated params
item_parameters_1PL <- coef(fit1PLa, IRTpars = TRUE, simplify = TRUE)$items
item_parameters_2PL <- coef(fit2PLa, IRTpars = TRUE, simplify = TRUE)$items

# item difficulty param "b", item discrimination param "a"
#> 1PL
psych::describe(x = item_parameters_1PL[,c("b", "a")])

  vars  n mean   sd median trimmed mad   min  max range skew kurtosis   se
b    1 30 1.55 1.55   1.65    1.56 1.5 -1.32 4.45  5.77 0.04    -0.83 0.28
a    2 30 1.00 0.00   1.00    1.00 0.0  1.00 1.00  0.00  NaN      NaN 0.00

#> 2PL
psych::describe(x = item_parameters_2PL[,c("b", "a")])

  vars  n mean   sd median trimmed  mad    min   max range  skew kurtosis   se
b    1 30 0.08 4.68   0.52    0.47 3.11 -14.87  9.47 24.34 -0.97     1.82 0.85
a    2 30 1.54 3.66   0.71    0.74 1.29  -1.45 14.67 16.13  2.97     7.87 0.67

### item fit statistics
# Calculate item fit statistics for the 2PL model
item_fit <- itemfit(fit2PLa)
item_fit[order(item_fit$S_X2),]

    item   S_X2 df.S_X2 RMSEA.S_X2 p.S_X2
5   4939  1.022       2      0.000  0.600
10  9291  1.383       2      0.000  0.501
16  9201  2.270       1      0.078  0.132
7   2271  2.414       3      0.000  0.491
18  2827  2.590       3      0.000  0.459
23  7817  5.709      10      0.000  0.839
27  2117  5.832       8      0.000  0.666
11  9091  6.009       4      0.049  0.198
6    604  6.616       3      0.076  0.085
24 10145  7.234      10      0.000  0.703
4  10950  7.438       7      0.017  0.385
28 11796 11.206       6      0.065  0.082
8  10027 11.949      11      0.020  0.367
19   392 12.247      11      0.023  0.345
22  7344 12.721       9      0.045  0.176
2    720 13.275      12      0.023  0.349
9     74 13.325       9      0.048  0.148
17  2896 13.995       8      0.060  0.082
3   8959 13.997      12      0.028  0.301
21  7905 14.517       8      0.063  0.069
1   8278 15.120       9      0.057  0.088
26  8821 16.668      10      0.057  0.082
30  1527 18.519       9      0.071  0.030
20 11970 19.683       8      0.084  0.012
12 10035 20.697      12      0.059  0.055
13 11781 22.033       9      0.084  0.009
25  7727 25.327       9      0.094  0.003
14  5048 31.517      11      0.095  0.001
15  9511 33.300      11      0.099  0.000
29 11670    NaN       0        NaN    NaN

and over some sets of items the 2PL model do do not converge:

########
# sub-sample items
########
set.seed(11111) # to replicate results

setSize <- 30

random_items <- sample(x = 1:ncol(data_wide), size = setSize, replace = FALSE)
sub_dat <- data_wide[, random_items]

# Fit 1PL model (Rasch model)
fit1PLa <- mirt(data = sub_dat,
               model = 1,  # one-dimensional model
               itemtype = "Rasch",
               verbose = FALSE)

# Fit 2PL model
fit2PLa <- mirt(data = sub_dat,
               model = 1,  #one-dimensional
               itemtype = "2PL",
               technical = list(NCYCLES = 2000, SEtol = 1e-4),
               verbose = FALSE)

EM cycles terminated after 2000 iterations.

check assumptions of such models

check for unidimensionality

to check for the unidimensionality of the data, we can - compute the correlation between the items using the Phi Coefficient - followed by an EFA, whereby parallel analysis is applied to determine the number of factors (see recommendations for such analysis Auerswald and Moshagen 2019)

use only a subset of data, the first variable is the model id:

cor_matrix <- cor(data_wide[,c(2:20, 70:85)]) # simply use Phi coefficient

psych::corPlot(r = cor(cor_matrix))

efa_parallel <- psych::fa.parallel(cor_matrix, fa = "fa", n.obs = nrow(data_wide), cor = "cor")

Parallel analysis suggests that the number of factors =  8  and the number of components =  NA

efa_results <- psych::fa(cor_matrix, nfactors = efa_parallel$nfact, fm = "pa")
efa_results

Factor Analysis using method =  pa
Call: psych::fa(r = cor_matrix, nfactors = efa_parallel$nfact, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
     PA1   PA2   PA3   PA5   PA4   PA6   PA7   PA8    h2   u2 com
0   0.04  0.50  0.00  0.11  0.09  0.34 -0.17 -0.03 0.588 0.41 2.3
1   0.36  0.51 -0.05  0.04 -0.12  0.07  0.11  0.06 0.562 0.44 2.2
2   0.32  0.30  0.07 -0.01  0.02  0.15 -0.08  0.08 0.380 0.62 2.8
3  -0.25 -0.04  0.14 -0.02  0.16  0.06  0.18  0.12 0.151 0.85 4.1
4  -0.04  0.74  0.13  0.08 -0.02 -0.11 -0.10  0.01 0.625 0.38 1.2
5   0.09  0.16 -0.15 -0.11 -0.65 -0.04 -0.01  0.03 0.468 0.53 1.4
6   0.72 -0.14  0.07  0.15 -0.14  0.04  0.10 -0.01 0.568 0.43 1.3
7   0.65  0.04 -0.06 -0.08 -0.05  0.10 -0.09  0.10 0.493 0.51 1.2
8   0.81  0.04 -0.02  0.05  0.01 -0.03  0.07  0.10 0.701 0.30 1.1
9   0.68  0.10  0.11 -0.11 -0.04  0.09 -0.04 -0.19 0.589 0.41 1.4
10  0.84 -0.05 -0.06  0.08  0.01 -0.11 -0.08 -0.09 0.685 0.31 1.1
11 -0.11 -0.02 -0.09 -0.01  0.03 -0.09 -0.20 -0.30 0.160 0.84 2.6
12  0.23  0.23  0.26  0.15  0.11  0.08 -0.37  0.26 0.678 0.32 5.1
13 -0.13 -0.01 -0.11 -0.01  0.07  0.09  0.08  0.11 0.055 0.94 5.2
14  0.05  0.41  0.13  0.18 -0.32  0.19  0.08 -0.08 0.410 0.59 3.2
15  0.63  0.20  0.00 -0.15  0.04 -0.01  0.04  0.17 0.556 0.44 1.5
16  0.18  0.03 -0.18 -0.02  0.04 -0.05  0.20 -0.02 0.102 0.90 3.3
17 -0.04  0.04 -0.13 -0.05 -0.10  0.00 -0.54 -0.06 0.303 0.70 1.3
18 -0.20  0.08  0.56 -0.01  0.02  0.07  0.09 -0.04 0.354 0.65 1.4
68 -0.01  0.36 -0.21 -0.18  0.41  0.10  0.14  0.04 0.385 0.61 3.4
69  0.00 -0.05 -0.04  0.26  0.01 -0.02  0.25  0.35 0.199 0.80 2.8
70  0.09  0.02  0.44 -0.23  0.05  0.02  0.30 -0.05 0.357 0.64 2.5
71 -0.04 -0.06 -0.14  0.09  0.00  0.63  0.01  0.00 0.391 0.61 1.2
72 -0.37  0.10  0.00 -0.01 -0.09 -0.18  0.00 -0.12 0.202 0.80 2.0
73 -0.03  0.13 -0.02 -0.17  0.05 -0.08  0.10  0.01 0.057 0.94 3.4
74  0.11  0.13 -0.14  0.33  0.28 -0.14  0.06 -0.32 0.390 0.61 4.4
75  0.07 -0.11  0.22 -0.16  0.25  0.29 -0.04 -0.06 0.230 0.77 4.1
76  0.47  0.10  0.10  0.01  0.11  0.07  0.14 -0.36 0.437 0.56 2.5
77  0.11  0.06 -0.02  0.03 -0.10 -0.09  0.27  0.15 0.110 0.89 2.7
78  0.52  0.05  0.07  0.06  0.19  0.21 -0.11  0.12 0.516 0.48 2.0
79  0.12  0.15 -0.19 -0.05  0.29 -0.15 -0.07  0.32 0.315 0.68 4.1
80 -0.04  0.22 -0.14  0.43  0.28 -0.12  0.01 -0.03 0.387 0.61 2.9
81  0.05  0.05  0.05  0.72 -0.01  0.11 -0.02  0.02 0.589 0.41 1.1
82  0.40  0.11  0.06  0.09  0.05  0.07 -0.10  0.21 0.361 0.64 2.2
83  0.09  0.09  0.63  0.08  0.04 -0.17 -0.07  0.03 0.470 0.53 1.3

                       PA1  PA2  PA3  PA5  PA4  PA6  PA7  PA8
SS loadings           4.82 2.14 1.41 1.31 1.20 1.06 1.00 0.89
Proportion Var        0.14 0.06 0.04 0.04 0.03 0.03 0.03 0.03
Cumulative Var        0.14 0.20 0.24 0.28 0.31 0.34 0.37 0.39
Proportion Explained  0.35 0.16 0.10 0.09 0.09 0.08 0.07 0.06
Cumulative Proportion 0.35 0.50 0.61 0.70 0.79 0.86 0.94 1.00

 With factor correlations of 
      PA1   PA2   PA3   PA5   PA4   PA6   PA7   PA8
PA1  1.00  0.38  0.13  0.19 -0.01  0.20 -0.06  0.09
PA2  0.38  1.00  0.13  0.21  0.12  0.18 -0.17  0.13
PA3  0.13  0.13  1.00  0.05 -0.01  0.13 -0.02 -0.01
PA5  0.19  0.21  0.05  1.00  0.09  0.05 -0.16 -0.01
PA4 -0.01  0.12 -0.01  0.09  1.00  0.03 -0.02  0.04
PA6  0.20  0.18  0.13  0.05  0.03  1.00 -0.06  0.06
PA7 -0.06 -0.17 -0.02 -0.16 -0.02 -0.06  1.00 -0.15
PA8  0.09  0.13 -0.01 -0.01  0.04  0.06 -0.15  1.00

Mean item complexity =  2.5
Test of the hypothesis that 8 factors are sufficient.

df null model =  595  with the objective function =  11.41
df of  the model are 343  and the objective function was  2.34 

The root mean square of the residuals (RMSR) is  0.04 
The df corrected root mean square of the residuals is  0.05 

Fit based upon off diagonal values = 0.96
Measures of factor score adequacy             
                                                   PA1  PA2  PA3  PA5  PA4  PA6
Correlation of (regression) scores with factors   0.96 0.90 0.84 0.84 0.82 0.80
Multiple R square of scores with factors          0.92 0.81 0.71 0.71 0.67 0.63
Minimum correlation of possible factor scores     0.83 0.61 0.41 0.43 0.35 0.27
                                                   PA7  PA8
Correlation of (regression) scores with factors   0.80 0.78
Multiple R square of scores with factors          0.63 0.60
Minimum correlation of possible factor scores     0.27 0.21

print(efa_results$loadings)


Loadings:
   PA1    PA2    PA3    PA5    PA4    PA6    PA7    PA8   
0          0.501         0.111         0.344 -0.172       
1   0.363  0.508               -0.124         0.107       
2   0.323  0.296                       0.148              
3  -0.254         0.141         0.161         0.182  0.122
4          0.744  0.133               -0.105              
5          0.161 -0.153 -0.110 -0.648                     
6   0.721 -0.143         0.151 -0.140                     
7   0.646                                                 
8   0.806                                                 
9   0.684         0.109 -0.107                      -0.186
10  0.842                             -0.108              
11 -0.114                                    -0.203 -0.300
12  0.234  0.227  0.263  0.148  0.114        -0.373  0.255
13 -0.127        -0.114                              0.106
14         0.415  0.125  0.178 -0.316  0.194              
15  0.628  0.201        -0.151                       0.165
16  0.182        -0.181                       0.204       
17               -0.128        -0.104        -0.535       
18 -0.205         0.564                                   
68         0.359 -0.213 -0.175  0.411         0.143       
69                       0.257                0.247  0.347
70                0.438 -0.227                0.305       
71               -0.137                0.632              
72 -0.370  0.104                      -0.183        -0.116
73         0.132        -0.171                0.102       
74  0.112  0.126 -0.141  0.332  0.283 -0.143        -0.325
75        -0.109  0.225 -0.158  0.254  0.288              
76  0.473  0.103                0.111         0.135 -0.363
77  0.107                                     0.270  0.146
78  0.515                       0.188  0.213 -0.107  0.116
79  0.116  0.147 -0.191         0.292 -0.148         0.322
80         0.222 -0.143  0.426  0.280 -0.120              
81                       0.720         0.107              
82  0.395  0.109                             -0.104  0.213
83                0.634               -0.166              

                 PA1   PA2   PA3   PA5   PA4   PA6   PA7   PA8
SS loadings    4.477 1.786 1.375 1.204 1.179 0.953 0.965 0.857
Proportion Var 0.128 0.051 0.039 0.034 0.034 0.027 0.028 0.024
Cumulative Var 0.128 0.179 0.218 0.253 0.286 0.314 0.341 0.366

num items to num persons

small simulation study
unstable if items >> persons

# see: https://www.r-bloggers.com/2012/12/simple-data-simulator-for-the-2pl-model/

twopl.sim         <- function( nitem = 20, npers = 100 ) {

  i.loc         <- rnorm( n = nitem, mean = 0, sd = 1 )
  p.loc         <- rnorm( n = npers, mean = 0, sd = 1 )
  i.slp         <- rlnorm( nitem, sdlog = .4 )

  temp          <- matrix( rep( p.loc, length( i.loc ) ), ncol = length( i.loc ) )

  logits        <- t( apply( temp  , 1, '-', i.loc) )
  logits        <- t( apply( logits, 1, '*', i.slp) )

  probabilities <- 1 / ( 1 + exp( -logits ) )

  resp.prob     <- matrix( probabilities, ncol = nitem)

  obs.resp      <- matrix( sapply( c(resp.prob), rbinom, n = 1, size = 1), ncol = length(i.loc) )

  output        <- list()
  output$i.loc  <- i.loc
  output$i.slp  <- i.slp
  output$p.loc  <- p.loc
  output$resp   <- obs.resp

  output
}

i << n (less items than persons/ LLMs):

set.seed(123)
data2pl <- twopl.sim(nitem=300,npers=3000)

# Track time to fit 2PL model
start_time <- Sys.time()

fit2PL <- mirt(data = as.data.frame(data2pl$resp),
               model = 1,  # one-dimensional
               itemtype = "2PL", verbose = FALSE)

end_time <- Sys.time()
fit_time <- end_time - start_time

# Print the time taken to fit the model
cat("Time taken to fit the 2PL model in seconds:", fit_time, "\n")

Time taken to fit the 2PL model in seconds: 9.550896

item_parameters_2PL <- coef(fit2PL, IRTpars = TRUE, simplify = TRUE)$items

# item difficulty param
summary(item_parameters_2PL[,"b"])

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.21305 -0.61690 -0.04756  0.03039  0.64970  3.54400

# item discrimination param
summary(item_parameters_2PL[,"a"])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2471  0.7974  1.0109  1.0992  1.3207  2.7445

layout(matrix(c(1,2),nrow=1))

plot(item_parameters_2PL[,"a"], data2pl$i.slp)
cor(item_parameters_2PL[,"a"], data2pl$i.slp)

[1] 0.991225

plot(item_parameters_2PL[,"b"], data2pl$i.loc)

cor(item_parameters_2PL[,"b"], data2pl$i.loc)

[1] 0.9976121

i >> n (more items than persons/ LLMs):

here we have the ration 4:1

set.seed(555)
data2pl <- twopl.sim(nitem=800,npers=200)

# Track time to fit 2PL model
start_time <- Sys.time()

fit2PL <- mirt(data = as.data.frame(data2pl$resp),
               model = 1,  # one-dimensional
               itemtype = "2PL", verbose = FALSE)

end_time <- Sys.time()
fit_time <- end_time - start_time

# Print the time taken to fit the model
cat("Time taken to fit the 2PL model in seconds:", fit_time, "\n")

Time taken to fit the 2PL model in seconds: 27.81824

item_parameters_2PL <- coef(fit2PL, IRTpars = TRUE, simplify = TRUE)$items

# Item difficulty param
summary(item_parameters_2PL[,"b"])

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 -5.3721  -0.5704   0.1177   0.3485   0.8485 180.3183

# Item discrimination param
summary(item_parameters_2PL[,"a"])

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002833 0.701107 0.965513 1.046822 1.285065 3.873584

# Plots
layout(matrix(c(1,2),nrow=1))

plot(item_parameters_2PL[,"a"], data2pl$i.slp)
cor(item_parameters_2PL[,"a"], data2pl$i.slp)

[1] 0.88547

plot(item_parameters_2PL[,"b"], data2pl$i.loc)

cor(item_parameters_2PL[,"b"], data2pl$i.loc)

[1] 0.1923016

boil it down to a prediction task

Central goal(s)

identify subset of items which strongly differentiating between LLMs
identify subset of items which are not differentiating between LLMs (vice versa)

simple Decision Trees

Idea motivated by: Goal of the decision tree was to provide a classification to identify and distinguish between biology-derived and technology-derived developments. It aimed to rationalize the discussion about these developments by incorporating descriptive, normative, and emotional aspects, ultimately achieving an average accuracy of 90.0% in classifying the examples (Speck et al. 2017).

Probably we could get better results using the Iterative Dichotomiser 3 (ID3) Algorithm:
- calculates the entropy of every attribute using the data set S,
- splits the set S into subsets using the attribute for which entropy is minimum (or, equivalently, information gain is maximum),
- makes a decision tree node containing that item, and
- recurs on subsets using remaining attributes.

# Create a CART model for classification
set.seed(123)  # for reproducibility
num_cols <- 501

sub_dat <- data_wide[, 1:num_cols]
colnames(sub_dat)[2:ncol(sub_dat)] <- paste0("V", 1:(ncol(sub_dat)-1))
model <- rpart(model_id ~ ., data = sub_dat)
rpart.plot(model)

## cross-validation not working, error of predict function:
# train_indices <- sample(1:nrow(data_wide), 0.7 * nrow(data_wide))
# train_data <- data_wide[train_indices, 1:num_cols]
# colnames(train_data)[2:ncol(train_data)] <- paste0("V", 1:(ncol(train_data)-1))
# test_data <- data_wide[-train_indices, 1:num_cols]
# colnames(test_data)[2:ncol(test_data)] <- paste0("V", 1:(ncol(test_data)-1))
# predictions <- predict(model, newdata = test_data, type = "class")

rigid regression, neutral network?

would address the p >> n problem in our case, we have fewer observations (LLMs) than predictors
- adds a penalty (alpha) to the regression, which shrinks the coefficients and helps in high-dimensional settings, alpha value determines the strength of this penalty—higher alpha values lead to more regularization
- necessary to cross-validate result (could also bootstrap as a sensitivity check)

Literature:

for regression models:
- in R there is a package called glmnet, which fits generalized linear and similar models via penalized maximum likelihood, see: https://glmnet.stanford.edu/articles/glmnet.html
for Structural Equation Models:
- in R there is a package called regsem, which implemented regularization for structural equation models (Jacobucci 2017); tutorial paper (Li, Jacobucci, and Ammerman 2021), guide to select variables (Jacobucci, Brandmaier, and Kievit 2019)
for Item Response Theory there are IRTrees, which are also applied to discover response styles in survey data

Random stuff

cite literature in Quarto (rmarkdown)

Blah blah (see Yarkoni and Westfall 2017, 33–35; also Speck et al. 2017, ch. 1).
Blah blah (Yarkoni and Westfall 2017, 33–35).
Blah blah (Yarkoni and Westfall 2017; Speck et al. 2017).
Rutkowski et al. says blah (2017).
Yarkoni and Westfall (2017) says blah.

Our ideas

Zuweisung Items zu inhaltlichen Bereichen: Zuordnung der jeweiligen Items im MMLU zu den inhaltlichen Bereichen; über 50 verschiedene zu sein (Astronomie, Mathe, Geschichte usw.)
mögl. Ansatz zur Lösung der Schätzprobleme (Simulationsstudie), wäre die “logits, probabilities (nach softmax)” der Antworten zu den Fragen (A-E, …) zu bekommen, um daraus beliebig viele Antworten für alle LLMs zu generieren:
- Annahme probability vector für (A-C): [.11, .12, .31] \rightarrow Summe bilden: 0.11+0.12+0.31=0.54 \rightarrow jedes Element durch Summe dividieren: \frac{.11}{.54} = .2037 usw. so viele 0, 1 ziehen wir dann

References

Auerswald, Max, and Morten Moshagen. 2019. “How to Determine the Number of Factors to Retain in Exploratory Factor Analysis: A Comparison of Extraction Methods Under Realistic Conditions.” Psychological Methods 24: 468–91. https://doi.org/10.1037/met0000200.

Biemer, Paul P., Edith D. de Leeuw, Stephanie Eckman, Brad Edwards, Frauke Kreuter, Lars E. Lyberg, N. Clyde Tucker, and Brady T. West. 2017. Total Survey Error in Practice. John Wiley & Sons.

Bollen, Kenneth A. 2002. “Latent Variables in Psychology and the Social Sciences.” Annual Review of Psychology 53 (January): 605–34. https://doi.org/10.1146/annurev.psych.53.100901.135239.

Borsboom, Denny. 2008. “Latent Variable Theory.” Measurement: Interdisciplinary Research and Perspectives 6 (1-2): 25–53. https://doi.org/10.1080/15366360802035497.

Cronbach, Lee J., and Paul E. Meehl. 1955. “Construct Validity in Psychological Tests.” Psychological Bulletin 52 (4): 281–302. https://doi.org/10.1037/h0040957.

Fligstein, Neil, and Doug McAdam. 2015. A Theory of Fields. Oxford University Press.

Groves, Robert M., and Lars Lyberg. 2010. “Total Survey Error: Past, Present, and Future.” Public Opinion Quarterly 74 (5): 849–79. https://doi.org/10.1093/poq/nfq065.

Jacobucci, Ross. 2017. “Regsem: Regularized Structural Equation Modeling.” arXiv. https://doi.org/10.48550/arXiv.1703.08489.

Jacobucci, Ross, Andreas M. Brandmaier, and Rogier A. Kievit. 2019. “A Practical Guide to Variable Selection in Structural Equation Modeling by Using Regularized Multiple-Indicators, Multiple-Causes Models.” Advances in Methods and Practices in Psychological Science 2 (1): 55–76. https://doi.org/10.1177/2515245919826527.

Konsortium Bundesbericht Wissenschaftlicher Nachwuchs. 2021. Bundesbericht Wissenschaftlicher Nachwuchs 2021. DE: wbv Media.

Leek, Jeffery T., and Roger D. Peng. 2015. “What Is the Question?” Science 347 (6228): 1314–15. https://doi.org/10.1126/science.aaa6146.

Li, Xiaobei, Ross Jacobucci, and Brooke A. Ammerman. 2021. “Tutorial on the Use of the Regsem Package in R.” Psych 3 (4): 579–92. https://doi.org/10.3390/psych3040038.

Luhmann, Niklas. 1987. Soziale Systeme: Grundriss Einer Allgemeinen Theorie. Suhrkamp.

Moosbrugger, Helfried, and Augustin Kelava, eds. 2020. Testtheorie und Fragebogenkonstruktion. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-61532-4.

Peng, Roger D., and Elizabeth Matsui. 2016. The Art of Data Science: A Guide for Anyone Who Works with Data. Lulu.com.

Rahal, Rima-Maria, Susann Fiedler, Adeyemi Adetula, Ronnie P.-A. Berntsson, Ulrich Dirnagl, Gordon B. Feld, Christian J. Fiebach, et al. 2023. “Quality Research Needs Good Working Conditions.” Nature Human Behaviour 7 (2): 164–67. https://doi.org/10.1038/s41562-022-01508-2.

Schurig, Michael. 2017. “Latente Variablenmodelle in der empirischen Bildungsforschung - die Schärfe und Struktur der Schatten an der Wand.” PhD thesis, TU Dortmund.

Sengewald, Marie-Ann, Mirka Henninger, Pia Bechtloff, and Veit Kubik. 2024. “Familiengerechte Karrieremöglichkeiten in Der Psychologischen Forschung?” Psychologische Rundschau 75 (3): 236–48. https://doi.org/10.1026/0033-3042/a000682.

Skrondal, Anders, and Sophia Rabe-Hesketh. 2004. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. New York: Chapman and Hall/CRC. https://doi.org/10.1201/9780203489437.

———. 2007. “Latent Variable Modelling: A Survey*.” Scandinavian Journal of Statistics 34 (4): 712–45. https://doi.org/10.1111/j.1467-9469.2007.00573.x.

Speck, Olga, David Speck, Rafael Horn, Johannes Gantner, and Klaus Peter Sedlbauer. 2017. “Biomimetic Bio-Inspired Biomorph Sustainable? An Attempt to Classify and Clarify Biology-Derived Technical Developments.” Bioinspiration & Biomimetics 12 (1): 011004. https://doi.org/10.1088/1748-3190/12/1/011004.

Tourangeau, Roger, Lance J. Rips, and Kenneth Rasinski. 2000. The Psychology of Survey Response. Cambridge University Press.

Wallerstein, Immanuel Maurice. 2006. European Universalism: The Rhetoric of Power. New Press.

Yarkoni, Tal, and Jacob Westfall. 2017. “Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning.” Perspectives on Psychological Science 12 (6): 1100–1122. https://doi.org/10.1177/1745691617693393.