## global variables:
# only if needed
Psychometrics Analyses of LLM benchmark data
Notes
Theory
Lines of arguments
Contradictions in the fields of Psychology <-> Machine Learning:
From my perspective, particularly within specific sub-disciplines of Psychology, the primary objective of statistical methods is to elucidate psychological mechanisms. For instance, in Experimental Psychology (Allgemeine Psychologie), researchers manipulate specific experimental conditions (independent variables) and measure the effects of these manipulations, ideally within randomized controlled trials (RCTs). Disciplines like Machine Learning, prioritize however prediction. Despite these differences, both fields can benefit from one another, as highlighted by Yarkoni and Westfall (2017).
Psychology has historically been concerned, first and foremost, with explaining the causal mechanisms that give rise to behavior. Randomized, tightly controlled experiments are enshrined as the gold standard of psychological research, and there are endless investigations of the various mediating and moderating variables that govern various behaviors. We argue that psychology’s near-total focus on explaining the causes of behavior has led much of the field to be populated by research programs that provide intricate theories of psychological mechanism but that have little (or unknown) ability to predict future behaviors with any appreciable accuracy. We propose that principles and techniques from the field of machine learning can help psychology become a more predictive science. We review some of the fundamental concepts and tools of machine learning and point out examples where these concepts have been used to conduct interesting and important psychological research that focuses on predictive research questions. We suggest that an increased focus on prediction, rather than explanation, can ultimately lead us to greater understanding of behavior.
However, an intense focus on prediction - especially when applying “black box” models - could spark criticism within the field. Potential critiques include:
- Lack of Interpretability: Black box models often provide little to no insight into the underlying mechanisms driving predictions, which contradicts the psychological emphasis on understanding (causal) relationships.
- Theoretical Disconnect: Machine learning models may not align with (or care of) established psychological theories, leading to a fragmentation between predictive accuracy and theoretical coherence.
- Ethical Concerns: The deployment of predictive models without a clear understanding of their decision-making processes can lead to ethical dilemmas, particularly in areas such as clinical psychology.
- …
Combing methods from multiple fields it is possible to adopt a more integrative approach, when investigating research questions/ analyzing data: it is possible to categorize types of research questions and their corresponding statistical models (e.g., EFA for exploration, ANOVA in context of an RCT to discover a “causal” effect, machine learning to maximize prediction, cluster analyses for identifying homogeneous subgroups,…) and every method can give important insights:
Contrary to the perception of data analysis as a linear process, it is inherently a highly iterative process where insights gained at each step lead to re-evaluating and refining previous steps before moving forward. This cyclical approach ensures continuous learning and adjustment, in contrast to a seemingly straightforward, linear application of specific statistical models like a cooking recipe:
To summarize, in my opinion several factors may contribute to the (at least implicit or often pragmatic driven) rejection of methodologies from other fields and their respective “research philosophies”. These factors reflect both structural and cultural pressures within academia:
- Pressure and Working Conditions in Academia: Researchers, particularly early-career scientists, face significant pressure to “publish or perish,” often under challenging working conditions. For example:
- In Germany, as reported by Konsortium Bundesbericht Wissenschaftlicher Nachwuchs (2021), approximately 90% of early-career researchers are employed on fixed-term contracts, with PhD students averaging contract lengths of only 22 months. Further despite a strong desire to start families, particularly among women, career uncertainties, poor work-life balance, and financial instability are mentioned as key reasons for the high rate of childlessness among young female researchers (see also Sengewald et al. 2024)
- Internationally, similar challenges have been highlighted in recent work, such as Rahal et al. (2023), which discusses ways to redesign academic systems, emphasizing the role of permanent employment
- Conservatism in Methodology: A common response to the pressure to “publish or perish” is to adhere to familiar statistical models, technological tools, and the prevailing working styles of one’s research group. This methodological conservatism helps reduce the risks associated with exploring unfamiliar or cross-disciplinary approaches, but it may limit innovation and interdisciplinary collaboration. Next to limited contracts and insecure working conditions, this methodological rigidity could also be explained by multiple sociological theories:
- Bourdieu’s Theory of Fields suggests that the academic field operates under specific power dynamics, where certain styles of working gain dominance based on their alignment with the interests of those in power (in the center of the field). Further academics feel pressure to conform to established norms within their discipline (Fligstein and McAdam 2015), see also YouTube Video ” Field theory - Pierre Bourdieu”
- Luhmann’s Theory of the Differentiation of Social Systems posits that academia, like other social systems, differentiates itself through specialized methodologies that reinforce the boundaries between disciplines (Luhmann 1987)
- Historical factors, including power-related dynamics, also play a role. For instance, Wallerstein’s work on European Universalism discusses how power structures have historically influenced which ideas and methodologies become dominant in different fields (Wallerstein 2006)
brief theory of latent variable models
In context of answering survey question the Cognitive Aspects of Survey Methodology (also called „Optimizing-Satisficing-Model“) tries to explain how people finally arrive at a response, which is a great heuristic for identifying possible sources of error (Tourangeau, Rips, and Rasinski 2000,; Moosbrugger and Kelava 2020):
Despite not knowing the exact processes of answering a question (black box), we fundamentally assume that the answer / reporting depends on a score on a latent variable:
the central task of test theory is to determine the relationship between test behaviour and the (psychological) characteristic to be assessed
Latent variable definition: random variables whose realized values are hidden (Bollen 2002; Borsboom 2008)
A possible operationalisation of a latent variable is a linear measurement model with the equation Y = \lambda * \eta + \epsilon.
this corresponds to the fundamental equation of Classical Test Theory: Y = T + \epsilon
\rightarrow models which are dealing with latent variables are called Latent Variable Models (Skrondal and Rabe-Hesketh 2004, 2007), whereby the central aim of latent variable models is to infer unobservable (latent) psychological traits or abilities from responses to test items
theoretical aspects “What is a Latent Variable?”
- the theoretical status of latent variables has not been clarified (Schurig 2017)
- are these variables representations of real entities or just useful inventions?
- Advantage: use of latent variables allow more generalisable reasoning than manifest variables
- latent variables need a substantive scientific foundation, whereby the bridging problem between observed and latent variable must be solved by theoretical assumptions and statistical modelling
centrally by assuming the local independence assumption in latent variable models it is assumed that, once the latent variable (e.g., a psychological trait or ability) is accounted for, the observed variables (such as responses to test items) are statistically independent of each other. This means that any correlation between the observed variables is fully explained by the latent variable, and no further direct relationships exist between the observed variables. In essence, the latent variable “absorbs” the shared variance, allowing the model to “get rid” of any inter-dependencies among the observed variables, simplifying the analysis (Skrondal and Rabe-Hesketh 2007).
Pr(y_j \mid \eta_j) = \prod_{i=1}^{n} Pr(y_{ij} \mid \eta_j)
This assumption do not hold in more complex data sets (e.g., multi-dimensionality, response styles, …) and by more complex models (e.g., the bifactor model) the variance of indicators of single measurment models (CFAs) is divided into common and unique variance:
The variance shared between the indicators is the commonality; the remaining variance is the unique variance, which is divided into indicator-specific method variance (specific) and measurement variance (error).
quality criteria of measurments
measurements, test administration should be carried out taking into account three central quality criteria of tests, which build on each other (no reliable measurement possible without objectivity, etc.):
\text{objectivity} \rightarrow \text{reliability} \rightarrow \text{validity}
- Objectivity: Test score is objective if it is independent of any influences outside the tested person (e.g., situational conditions, experimenter – all exogenous variables whose covariance structure is not explained by the statistical model, including error terms, unobserved influencing variables, and exogenous latent constructs)
- Implementation objectivity (Durchführungsobjektivität): Standardization of implementation conditions (writing a test manual, training test leader, standardization of all other conditions).
- Objectivity of evaluation (Auswertungsobjektivität): The interpretation of the test result is not dependent on the person who evaluates the test (measurable by inter-rater reliability, such as Kendall’s coefficient of concordance).
- Objectivity of interpretation (Interpretationsobjektivität): Different test users come to the same conclusions with identical test scores.
- Reliability: The extent to which a test measures what it is intended to measure. The focus here is on measurement accuracy. Reliability is demonstrated theoretically by the fact that repeated measurements under the same conditions produce the same measurement results (the central contribution to the development of reliability measurement is made by classical test theory, which establishes a theory of measurement error).
- Reliability can be estimated by different methods, often as a measure of internal consistency - Cronbach’s Alpha (a measure of how items in a scale correlate with one another) is commonly used.
The classical test theory assumes that the test performance of a person on the question i is composed of x_{i} = \tau_{i} + \epsilon_{i}. Here \tau_{i} corresponds to the person’s true score on question i, which is composed of an item response x_{i} and the error \epsilon_{i}, where the error is unbiased (if there are systematic aspects in the errors, apply models with can do variance splitting)
- Validity: A test is considered valid if it actually measures the characteristic it is supposed to measure and not some other characteristic. The measurement of validity is done in two steps:
- Via structure-searching (such as exploratory factor analysis) and structure-testing (such as confirmatory factor analysis) procedures, construct validity is determined. This indicates the extent to which conclusions can be drawn from test results, for example, about psychological personality traits.
- The agreement of the results of the individual constructs should be high with constructs that measure the same or similar characteristics (convergent validity), and the agreement with results from constructs that measure other characteristics should be low (discriminant validity). This can be analyzed via (latent) correlations, see also construct validity (Cronbach and Meehl 1955)
Importantly more recently there is also an argument-based approach to validation. To validate an interpretation or use of measurements is to evaluate the rationale, or argument, for the proposed conclusions and decisions … Ultimately, the need for validation derives from the scientific and social requirement that public claims and decisions need to be justified:
- interpretive argument: specifies the proposed interpretations and applications of assessment results by laying out a network of inferences and assumptions leading from the observed performances to the conclusions and decisions based on the assessment scores
- validity argument: provides an evaluation of the interpretive argument’s coherence and the plausibility of its inferences and assumptions
A variety of further quality criteria of indicators were developed by the „Key National Indicators Initiative“ (outdated, ~ 2005):
To reflect on all possible factors influencing the quality of a survey, there is also the concept of the Total Survey Error (see Biemer et al. 2017; Groves and Lyberg 2010).
assumptions 1PL, 2PL model
Before running any statistical model, normally - if model is sensitive to the violation of a specific assumption - the assumptions of the models are tested; for the Raschmodell (1PL) we have the following assumptions (aus Julius Masterarbeit):
- Eindimensionalität: Die Lösungswahrscheinlichkeit hängt nur von einem latenten Merkmal ab und ist durch die Modellparameter \theta_v, \beta_i bestimmt. Neben den Modellparametern liegen keine weiteren beeinflussenden Variablen \varphi vor: P(X_{vi} = 1 \mid \theta_v, \beta_i, \varphi) = (X_{vi} = 1 \mid \theta_v, \beta_i)
- Lokale stochastische Unabhängigkeit: Wenn die Personenfähigkeit \theta_v auf einem Wert konstant gehalten wird, verschwindet die Korrelation zwischen jedem möglichen Itempaar X_{vi}, X_{vj} im Test (wobei i \neq j): P(X_{vi} \perp X_{vj} \mid \theta_v), \forall \,\, i, j
- Suffizienz der Summenscores: Die Summenscores R_v = \sum_{i=1}^{k} X_{vi} eines Tests der Länge k sind ausreichend für die Schätzung der Personenfähigkeit einer Person \theta_v. Das Gleiche gilt analog für die Itemscores C_i = \sum_{v=1}^{n} X_{vi}
- Monotonie: Die Lösungswahrscheinlichkeit eines Items x_{vi} erhöht sich monoton mit höheren Personenfähigkeitswerten \theta. Je fähiger eine Person ist, desto wahrscheinlicher wird diese ein Item beantworten können. Dies drückt sich in der ICC f(x_{vi} \mid \theta_v, \beta_i) wie folgt aus: \theta_v > \theta_w: f(x_{vi} \mid \theta_v, \beta_i) > f(x_{wi} \mid \theta_w, \beta_i), \forall \,\, \theta_v, \theta_w
In the 2PL model, each item has additionally its own discrimination parameter, allowing some items to be better at differentiating between individuals with slightly different abilities.
If now the 1PL or 2PL is not fitting, this could have multiple reasons:
- Multidimensionality, items are influenced by more than one latent trait
- Poorly Fitting Items or Data, some items behave in an unexpected or erratic manner, it may not exhibit the expected increasing probability of a correct response as ability increases, especially if the item discrimnation parameter \alpha_i is estimated negative (Monotonicity violated)
- Ceiling or Floor Effects, test contains items that are either too easy or too difficult for the population being measured, the probability of a correct response might not vary meaningfully with ability over a range of \theta values.
- …
load packages, data
# sets the directory of location of this script as the current directory
# setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # not needed in Quatro env.
### load packages
#> This function is a wrapper for library and require. It checks to see if a package is installed, if not it attempts to install the package
require(pacman)
p_load('mirt', 'ggplot2', 'dplyr', 'tidyr', 'parallel', 'psych',
'rpart', 'rpart.plot')
### load, prepare data
setwd("data")
#> long format
<- read.csv("mmlu_data.csv")
data_long <- select(data_long, -"model")
data_long
#> remove items
<- data_long %>%
items_to_remove group_by(doc_id) %>%
summarize(mean = mean(acc), variance = var(acc))
# any items which are solved by every LLM?
cat("number of items solved by all LLMs:", sum(items_to_remove$mean == 1), "\n")
number of items solved by all LLMs: 0
<- items_to_remove %>%
items_to_remove filter(variance == 0) %>%
select(doc_id) %>%
unlist() %>%
as.character()
<- data_long %>%
data_long filter(!doc_id %in% items_to_remove)
cat("number of items which have been removed:", length(items_to_remove), "\n")
number of items which have been removed: 54
#> wide format
<- spread(data_long, doc_id, acc)
data_wide # data_wide <- select(data_wide, -"model_id")
making one suspicious
1PL, 2PL over subsample
over some sets of items the 2PL model do converge:
! but, the models shows
- bad item fit statistics:
- S_{X2}: This is the test statistic for the S_{X2} test. It compares observed and expected item response patterns under the model, a high p-value (above 0.05) suggests that the item fits the model well.
- poor Item Fit (p-values < 0.05) for items 8, 27, 13, 22, 3
- item 3 has missing values (NaN) for theS_{X2} test and its degrees of freedom. This could indicate a lack of variability in the responses or issues with the data for this specific item.
- too large item difficulties and / or item discrimination parameters
### sub-sample items
set.seed(12345) # to replicate results
<- 30
setSize
<- sample(x = 2:ncol(data_wide), size = setSize, replace = FALSE) # first variable is the model id
random_items <- data_wide[, random_items]
sub_dat
# Fit 1PL model (Rasch model)
<- mirt(data = sub_dat,
fit1PLa model = 1, # one-dimensional model
itemtype = "Rasch",
verbose = FALSE)
# Fit 2PL model
<- mirt(data = sub_dat,
fit2PLa model = 1, #one-dimensional
itemtype = "2PL",
technical = list(NCYCLES = 2000, SEtol = 1e-4),
verbose = FALSE)
### estimated params
<- coef(fit1PLa, IRTpars = TRUE, simplify = TRUE)$items
item_parameters_1PL <- coef(fit2PLa, IRTpars = TRUE, simplify = TRUE)$items
item_parameters_2PL
# item difficulty param "b", item discrimination param "a"
#> 1PL
::describe(x = item_parameters_1PL[,c("b", "a")]) psych
vars n mean sd median trimmed mad min max range skew kurtosis se
b 1 30 1.55 1.55 1.65 1.56 1.5 -1.32 4.45 5.77 0.04 -0.83 0.28
a 2 30 1.00 0.00 1.00 1.00 0.0 1.00 1.00 0.00 NaN NaN 0.00
#> 2PL
::describe(x = item_parameters_2PL[,c("b", "a")]) psych
vars n mean sd median trimmed mad min max range skew kurtosis se
b 1 30 0.08 4.68 0.52 0.47 3.11 -14.87 9.47 24.34 -0.97 1.82 0.85
a 2 30 1.54 3.66 0.71 0.74 1.29 -1.45 14.67 16.13 2.97 7.87 0.67
### item fit statistics
# Calculate item fit statistics for the 2PL model
<- itemfit(fit2PLa)
item_fit order(item_fit$S_X2),] item_fit[
item S_X2 df.S_X2 RMSEA.S_X2 p.S_X2
5 4939 1.022 2 0.000 0.600
10 9291 1.383 2 0.000 0.501
16 9201 2.270 1 0.078 0.132
7 2271 2.414 3 0.000 0.491
18 2827 2.590 3 0.000 0.459
23 7817 5.709 10 0.000 0.839
27 2117 5.832 8 0.000 0.666
11 9091 6.009 4 0.049 0.198
6 604 6.616 3 0.076 0.085
24 10145 7.234 10 0.000 0.703
4 10950 7.438 7 0.017 0.385
28 11796 11.206 6 0.065 0.082
8 10027 11.949 11 0.020 0.367
19 392 12.247 11 0.023 0.345
22 7344 12.721 9 0.045 0.176
2 720 13.275 12 0.023 0.349
9 74 13.325 9 0.048 0.148
17 2896 13.995 8 0.060 0.082
3 8959 13.997 12 0.028 0.301
21 7905 14.517 8 0.063 0.069
1 8278 15.120 9 0.057 0.088
26 8821 16.668 10 0.057 0.082
30 1527 18.519 9 0.071 0.030
20 11970 19.683 8 0.084 0.012
12 10035 20.697 12 0.059 0.055
13 11781 22.033 9 0.084 0.009
25 7727 25.327 9 0.094 0.003
14 5048 31.517 11 0.095 0.001
15 9511 33.300 11 0.099 0.000
29 11670 NaN 0 NaN NaN
and over some sets of items the 2PL model do do not converge:
########
# sub-sample items
########
set.seed(11111) # to replicate results
<- 30
setSize
<- sample(x = 1:ncol(data_wide), size = setSize, replace = FALSE)
random_items <- data_wide[, random_items]
sub_dat
# Fit 1PL model (Rasch model)
<- mirt(data = sub_dat,
fit1PLa model = 1, # one-dimensional model
itemtype = "Rasch",
verbose = FALSE)
# Fit 2PL model
<- mirt(data = sub_dat,
fit2PLa model = 1, #one-dimensional
itemtype = "2PL",
technical = list(NCYCLES = 2000, SEtol = 1e-4),
verbose = FALSE)
EM cycles terminated after 2000 iterations.
check assumptions of such models
check for unidimensionality
to check for the unidimensionality of the data, we can - compute the correlation between the items using the Phi Coefficient - followed by an EFA, whereby parallel analysis is applied to determine the number of factors (see recommendations for such analysis Auerswald and Moshagen 2019)
use only a subset of data, the first variable is the model id:
<- cor(data_wide[,c(2:20, 70:85)]) # simply use Phi coefficient
cor_matrix
::corPlot(r = cor(cor_matrix)) psych
<- psych::fa.parallel(cor_matrix, fa = "fa", n.obs = nrow(data_wide), cor = "cor") efa_parallel
Parallel analysis suggests that the number of factors = 8 and the number of components = NA
<- psych::fa(cor_matrix, nfactors = efa_parallel$nfact, fm = "pa")
efa_results efa_results
Factor Analysis using method = pa
Call: psych::fa(r = cor_matrix, nfactors = efa_parallel$nfact, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 PA3 PA5 PA4 PA6 PA7 PA8 h2 u2 com
0 0.04 0.50 0.00 0.11 0.09 0.34 -0.17 -0.03 0.588 0.41 2.3
1 0.36 0.51 -0.05 0.04 -0.12 0.07 0.11 0.06 0.562 0.44 2.2
2 0.32 0.30 0.07 -0.01 0.02 0.15 -0.08 0.08 0.380 0.62 2.8
3 -0.25 -0.04 0.14 -0.02 0.16 0.06 0.18 0.12 0.151 0.85 4.1
4 -0.04 0.74 0.13 0.08 -0.02 -0.11 -0.10 0.01 0.625 0.38 1.2
5 0.09 0.16 -0.15 -0.11 -0.65 -0.04 -0.01 0.03 0.468 0.53 1.4
6 0.72 -0.14 0.07 0.15 -0.14 0.04 0.10 -0.01 0.568 0.43 1.3
7 0.65 0.04 -0.06 -0.08 -0.05 0.10 -0.09 0.10 0.493 0.51 1.2
8 0.81 0.04 -0.02 0.05 0.01 -0.03 0.07 0.10 0.701 0.30 1.1
9 0.68 0.10 0.11 -0.11 -0.04 0.09 -0.04 -0.19 0.589 0.41 1.4
10 0.84 -0.05 -0.06 0.08 0.01 -0.11 -0.08 -0.09 0.685 0.31 1.1
11 -0.11 -0.02 -0.09 -0.01 0.03 -0.09 -0.20 -0.30 0.160 0.84 2.6
12 0.23 0.23 0.26 0.15 0.11 0.08 -0.37 0.26 0.678 0.32 5.1
13 -0.13 -0.01 -0.11 -0.01 0.07 0.09 0.08 0.11 0.055 0.94 5.2
14 0.05 0.41 0.13 0.18 -0.32 0.19 0.08 -0.08 0.410 0.59 3.2
15 0.63 0.20 0.00 -0.15 0.04 -0.01 0.04 0.17 0.556 0.44 1.5
16 0.18 0.03 -0.18 -0.02 0.04 -0.05 0.20 -0.02 0.102 0.90 3.3
17 -0.04 0.04 -0.13 -0.05 -0.10 0.00 -0.54 -0.06 0.303 0.70 1.3
18 -0.20 0.08 0.56 -0.01 0.02 0.07 0.09 -0.04 0.354 0.65 1.4
68 -0.01 0.36 -0.21 -0.18 0.41 0.10 0.14 0.04 0.385 0.61 3.4
69 0.00 -0.05 -0.04 0.26 0.01 -0.02 0.25 0.35 0.199 0.80 2.8
70 0.09 0.02 0.44 -0.23 0.05 0.02 0.30 -0.05 0.357 0.64 2.5
71 -0.04 -0.06 -0.14 0.09 0.00 0.63 0.01 0.00 0.391 0.61 1.2
72 -0.37 0.10 0.00 -0.01 -0.09 -0.18 0.00 -0.12 0.202 0.80 2.0
73 -0.03 0.13 -0.02 -0.17 0.05 -0.08 0.10 0.01 0.057 0.94 3.4
74 0.11 0.13 -0.14 0.33 0.28 -0.14 0.06 -0.32 0.390 0.61 4.4
75 0.07 -0.11 0.22 -0.16 0.25 0.29 -0.04 -0.06 0.230 0.77 4.1
76 0.47 0.10 0.10 0.01 0.11 0.07 0.14 -0.36 0.437 0.56 2.5
77 0.11 0.06 -0.02 0.03 -0.10 -0.09 0.27 0.15 0.110 0.89 2.7
78 0.52 0.05 0.07 0.06 0.19 0.21 -0.11 0.12 0.516 0.48 2.0
79 0.12 0.15 -0.19 -0.05 0.29 -0.15 -0.07 0.32 0.315 0.68 4.1
80 -0.04 0.22 -0.14 0.43 0.28 -0.12 0.01 -0.03 0.387 0.61 2.9
81 0.05 0.05 0.05 0.72 -0.01 0.11 -0.02 0.02 0.589 0.41 1.1
82 0.40 0.11 0.06 0.09 0.05 0.07 -0.10 0.21 0.361 0.64 2.2
83 0.09 0.09 0.63 0.08 0.04 -0.17 -0.07 0.03 0.470 0.53 1.3
PA1 PA2 PA3 PA5 PA4 PA6 PA7 PA8
SS loadings 4.82 2.14 1.41 1.31 1.20 1.06 1.00 0.89
Proportion Var 0.14 0.06 0.04 0.04 0.03 0.03 0.03 0.03
Cumulative Var 0.14 0.20 0.24 0.28 0.31 0.34 0.37 0.39
Proportion Explained 0.35 0.16 0.10 0.09 0.09 0.08 0.07 0.06
Cumulative Proportion 0.35 0.50 0.61 0.70 0.79 0.86 0.94 1.00
With factor correlations of
PA1 PA2 PA3 PA5 PA4 PA6 PA7 PA8
PA1 1.00 0.38 0.13 0.19 -0.01 0.20 -0.06 0.09
PA2 0.38 1.00 0.13 0.21 0.12 0.18 -0.17 0.13
PA3 0.13 0.13 1.00 0.05 -0.01 0.13 -0.02 -0.01
PA5 0.19 0.21 0.05 1.00 0.09 0.05 -0.16 -0.01
PA4 -0.01 0.12 -0.01 0.09 1.00 0.03 -0.02 0.04
PA6 0.20 0.18 0.13 0.05 0.03 1.00 -0.06 0.06
PA7 -0.06 -0.17 -0.02 -0.16 -0.02 -0.06 1.00 -0.15
PA8 0.09 0.13 -0.01 -0.01 0.04 0.06 -0.15 1.00
Mean item complexity = 2.5
Test of the hypothesis that 8 factors are sufficient.
df null model = 595 with the objective function = 11.41
df of the model are 343 and the objective function was 2.34
The root mean square of the residuals (RMSR) is 0.04
The df corrected root mean square of the residuals is 0.05
Fit based upon off diagonal values = 0.96
Measures of factor score adequacy
PA1 PA2 PA3 PA5 PA4 PA6
Correlation of (regression) scores with factors 0.96 0.90 0.84 0.84 0.82 0.80
Multiple R square of scores with factors 0.92 0.81 0.71 0.71 0.67 0.63
Minimum correlation of possible factor scores 0.83 0.61 0.41 0.43 0.35 0.27
PA7 PA8
Correlation of (regression) scores with factors 0.80 0.78
Multiple R square of scores with factors 0.63 0.60
Minimum correlation of possible factor scores 0.27 0.21
print(efa_results$loadings)
Loadings:
PA1 PA2 PA3 PA5 PA4 PA6 PA7 PA8
0 0.501 0.111 0.344 -0.172
1 0.363 0.508 -0.124 0.107
2 0.323 0.296 0.148
3 -0.254 0.141 0.161 0.182 0.122
4 0.744 0.133 -0.105
5 0.161 -0.153 -0.110 -0.648
6 0.721 -0.143 0.151 -0.140
7 0.646
8 0.806
9 0.684 0.109 -0.107 -0.186
10 0.842 -0.108
11 -0.114 -0.203 -0.300
12 0.234 0.227 0.263 0.148 0.114 -0.373 0.255
13 -0.127 -0.114 0.106
14 0.415 0.125 0.178 -0.316 0.194
15 0.628 0.201 -0.151 0.165
16 0.182 -0.181 0.204
17 -0.128 -0.104 -0.535
18 -0.205 0.564
68 0.359 -0.213 -0.175 0.411 0.143
69 0.257 0.247 0.347
70 0.438 -0.227 0.305
71 -0.137 0.632
72 -0.370 0.104 -0.183 -0.116
73 0.132 -0.171 0.102
74 0.112 0.126 -0.141 0.332 0.283 -0.143 -0.325
75 -0.109 0.225 -0.158 0.254 0.288
76 0.473 0.103 0.111 0.135 -0.363
77 0.107 0.270 0.146
78 0.515 0.188 0.213 -0.107 0.116
79 0.116 0.147 -0.191 0.292 -0.148 0.322
80 0.222 -0.143 0.426 0.280 -0.120
81 0.720 0.107
82 0.395 0.109 -0.104 0.213
83 0.634 -0.166
PA1 PA2 PA3 PA5 PA4 PA6 PA7 PA8
SS loadings 4.477 1.786 1.375 1.204 1.179 0.953 0.965 0.857
Proportion Var 0.128 0.051 0.039 0.034 0.034 0.027 0.028 0.024
Cumulative Var 0.128 0.179 0.218 0.253 0.286 0.314 0.341 0.366
num items to num persons
- small simulation study
- unstable if items >> persons
# see: https://www.r-bloggers.com/2012/12/simple-data-simulator-for-the-2pl-model/
<- function( nitem = 20, npers = 100 ) {
twopl.sim
<- rnorm( n = nitem, mean = 0, sd = 1 )
i.loc <- rnorm( n = npers, mean = 0, sd = 1 )
p.loc <- rlnorm( nitem, sdlog = .4 )
i.slp
<- matrix( rep( p.loc, length( i.loc ) ), ncol = length( i.loc ) )
temp
<- t( apply( temp , 1, '-', i.loc) )
logits <- t( apply( logits, 1, '*', i.slp) )
logits
<- 1 / ( 1 + exp( -logits ) )
probabilities
<- matrix( probabilities, ncol = nitem)
resp.prob
<- matrix( sapply( c(resp.prob), rbinom, n = 1, size = 1), ncol = length(i.loc) )
obs.resp
<- list()
output $i.loc <- i.loc
output$i.slp <- i.slp
output$p.loc <- p.loc
output$resp <- obs.resp
output
output }
i << n (less items than persons/ LLMs):
set.seed(123)
<- twopl.sim(nitem=300,npers=3000)
data2pl
# Track time to fit 2PL model
<- Sys.time()
start_time
<- mirt(data = as.data.frame(data2pl$resp),
fit2PL model = 1, # one-dimensional
itemtype = "2PL", verbose = FALSE)
<- Sys.time()
end_time <- end_time - start_time
fit_time
# Print the time taken to fit the model
cat("Time taken to fit the 2PL model in seconds:", fit_time, "\n")
Time taken to fit the 2PL model in seconds: 9.550896
<- coef(fit2PL, IRTpars = TRUE, simplify = TRUE)$items
item_parameters_2PL
# item difficulty param
summary(item_parameters_2PL[,"b"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.21305 -0.61690 -0.04756 0.03039 0.64970 3.54400
# item discrimination param
summary(item_parameters_2PL[,"a"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2471 0.7974 1.0109 1.0992 1.3207 2.7445
layout(matrix(c(1,2),nrow=1))
plot(item_parameters_2PL[,"a"], data2pl$i.slp)
cor(item_parameters_2PL[,"a"], data2pl$i.slp)
[1] 0.991225
plot(item_parameters_2PL[,"b"], data2pl$i.loc)
cor(item_parameters_2PL[,"b"], data2pl$i.loc)
[1] 0.9976121
i >> n (more items than persons/ LLMs):
- here we have the ration 4:1
set.seed(555)
<- twopl.sim(nitem=800,npers=200)
data2pl
# Track time to fit 2PL model
<- Sys.time()
start_time
<- mirt(data = as.data.frame(data2pl$resp),
fit2PL model = 1, # one-dimensional
itemtype = "2PL", verbose = FALSE)
<- Sys.time()
end_time <- end_time - start_time
fit_time
# Print the time taken to fit the model
cat("Time taken to fit the 2PL model in seconds:", fit_time, "\n")
Time taken to fit the 2PL model in seconds: 27.81824
<- coef(fit2PL, IRTpars = TRUE, simplify = TRUE)$items
item_parameters_2PL
# Item difficulty param
summary(item_parameters_2PL[,"b"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.3721 -0.5704 0.1177 0.3485 0.8485 180.3183
# Item discrimination param
summary(item_parameters_2PL[,"a"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.002833 0.701107 0.965513 1.046822 1.285065 3.873584
# Plots
layout(matrix(c(1,2),nrow=1))
plot(item_parameters_2PL[,"a"], data2pl$i.slp)
cor(item_parameters_2PL[,"a"], data2pl$i.slp)
[1] 0.88547
plot(item_parameters_2PL[,"b"], data2pl$i.loc)
cor(item_parameters_2PL[,"b"], data2pl$i.loc)
[1] 0.1923016
boil it down to a prediction task
Central goal(s)
- identify subset of items which strongly differentiating between LLMs
- identify subset of items which are not differentiating between LLMs (vice versa)
simple Decision Trees
Idea motivated by: Goal of the decision tree was to provide a classification to identify and distinguish between biology-derived and technology-derived developments. It aimed to rationalize the discussion about these developments by incorporating descriptive, normative, and emotional aspects, ultimately achieving an average accuracy of 90.0% in classifying the examples (Speck et al. 2017).
- Probably we could get better results using the Iterative Dichotomiser 3 (ID3) Algorithm:
- calculates the entropy of every attribute using the data set S,
- splits the set S into subsets using the attribute for which entropy is minimum (or, equivalently, information gain is maximum),
- makes a decision tree node containing that item, and
- recurs on subsets using remaining attributes.
# Create a CART model for classification
set.seed(123) # for reproducibility
<- 501
num_cols
<- data_wide[, 1:num_cols]
sub_dat colnames(sub_dat)[2:ncol(sub_dat)] <- paste0("V", 1:(ncol(sub_dat)-1))
<- rpart(model_id ~ ., data = sub_dat)
model rpart.plot(model)
## cross-validation not working, error of predict function:
# train_indices <- sample(1:nrow(data_wide), 0.7 * nrow(data_wide))
# train_data <- data_wide[train_indices, 1:num_cols]
# colnames(train_data)[2:ncol(train_data)] <- paste0("V", 1:(ncol(train_data)-1))
# test_data <- data_wide[-train_indices, 1:num_cols]
# colnames(test_data)[2:ncol(test_data)] <- paste0("V", 1:(ncol(test_data)-1))
# predictions <- predict(model, newdata = test_data, type = "class")
rigid regression, neutral network?
- would address the p >> n problem in our case, we have fewer observations (LLMs) than predictors
- adds a penalty (alpha) to the regression, which shrinks the coefficients and helps in high-dimensional settings, alpha value determines the strength of this penalty—higher alpha values lead to more regularization
- necessary to cross-validate result (could also bootstrap as a sensitivity check)
Literature:
- for regression models:
- in R there is a package called glmnet, which fits generalized linear and similar models via penalized maximum likelihood, see: https://glmnet.stanford.edu/articles/glmnet.html
- for Structural Equation Models:
- in R there is a package called regsem, which implemented regularization for structural equation models (Jacobucci 2017); tutorial paper (Li, Jacobucci, and Ammerman 2021), guide to select variables (Jacobucci, Brandmaier, and Kievit 2019)
- for Item Response Theory there are IRTrees, which are also applied to discover response styles in survey data
Random stuff
cite literature in Quarto (rmarkdown)
- Blah blah (see Yarkoni and Westfall 2017, 33–35; also Speck et al. 2017, ch. 1).
- Blah blah (Yarkoni and Westfall 2017, 33–35).
- Blah blah (Yarkoni and Westfall 2017; Speck et al. 2017).
- Rutkowski et al. says blah (2017).
- Yarkoni and Westfall (2017) says blah.
Our ideas
- Zuweisung Items zu inhaltlichen Bereichen: Zuordnung der jeweiligen Items im MMLU zu den inhaltlichen Bereichen; über 50 verschiedene zu sein (Astronomie, Mathe, Geschichte usw.)
- mögl. Ansatz zur Lösung der Schätzprobleme (Simulationsstudie), wäre die “logits, probabilities (nach softmax)” der Antworten zu den Fragen (A-E, …) zu bekommen, um daraus beliebig viele Antworten für alle LLMs zu generieren:
- Annahme probability vector für (A-C): [.11, .12, .31] \rightarrow Summe bilden: 0.11+0.12+0.31=0.54 \rightarrow jedes Element durch Summe dividieren: \frac{.11}{.54} = .2037 usw. so viele 0, 1 ziehen wir dann