Ising Model and Applications to Psychometrics

Daniel Njoo

Prof. Leise

Stochastic Processes

Motivation

Mapping of the human genome began in 1990, and finished 11 years later in 2011.

a further 17 years later and we still don't really know everything about gene expression

Mapping of the human brain began in 2013… will finish _____???

when will we understand it? 2050?

In the meantime, psychometrics, the process of gleaning insight into human behavior from observed test results is the best tool we have

Outline

The usefulness of the equivalency between the Ising model, and the 2PL Multidimensional Item Response Theory model
Introduce the Ising model
- Markov Random Fields (MRF)
- Pairwise MRFs
- Binary PMRF \( \rightarrow \) the Ising model
Introduce the 2PL Multidimensional Item Response Theory (MIRT) model
- 1- and 2-parameter logistic (2PL) IRT
- 2PL MIRT

Usefulness

3 approaches in (psychometric, but also extends to education) testing:

(1) Classical Test Theory: assumes after accounting for external conditions, every test taker has a true score

(2) Item Response Theory: assumes a latent variable model, AKA a common cause model, i.e. that test takers have a latent ability which can be measured regardless of a test's content— Spearman's \( G \) factor

(3) Network approach: assumes observed symptoms (e.g. lack of sleep) are not indicators of a latent disorder (e.g. depression), instead they are causally related

Usefulness

The problem with traditional IRT models, i.e. (LV models), is that it is:

(1) difficult to incorporate uncertainty item parameter estimates

A Bayesian approach that treats parameters as RVs and seeks to find the multivariate conditional distribution given the observed data solves this

(2) difficult to maintain simple parameter estimates as model complexity increases

MCMC simulation methods solve this

So a Gibbs Sampling the conditional distribution of the parameters wold work great! Especially in avoiding excess computation as we'll see later. (NB didn't actually implement Gibbs Sampling in my project)

Consequences

So if we reframe a traditional IRT model into a network, we get the benefits of the network approach, which means

we can expect interventions on individual nodes to propogate to other nodes, in ways they would not in a traditional latent variable model
- this is testable in 3 ways:

manipulation of nodes
manipulation of network connections
investigation of network over time

Network Models in Psychometrics

Two popular model paradigms in psychometrics:

(1) Gaussian Graphical Model

(2) Ising Model

Markov Random Fields

AKA Markov Networks
undirected graphical models composed of RVs that satisfy the Markov property:
- \( {\mathbb {E} [f(X_{t})\mid {\mathcal {F}}_{s}]=\mathbb {E} [f(X_{t})\mid \sigma (X_{s})]} \hspace{.5cm}\forall t\geq s\geq 0 \text{ and } { f:S\rightarrow \mathbb {R} } \)
Using common notation, a graph \( G=(V,E) \), consists of a set of vertices \( V=(1,2,\dots,N) \) and edges \( E \), where since the graph is undirected the following edges are the same \( (i,j)=(j,i) \) where \( i,j\in V \).

Markov Random Fields

In the above example G consists of three nodes, \( V=\{1,2,3\} \), and two edges, \( E=\{(1,2),(2,3)\} \).

Markov Random Fields

MRFs give us two important corollaries:

(1) Any two non-adjacent nodes are conditionally independent given all other variables.
\[ {X_{u}\perp \!\!\!\perp X_{v}\mid X_{\setminus \{u,v\}}} \]

(2) Any node is conditionally independent of other variables given its neighbors. \[ X_{v}\perp \!\!\!\perp X_{\setminus \{v\}\mid X_{{N} (v)}} \]

Markov Random Fields

Therefore:

the nodes in a MRF can represent observed random variables,
while the edges represent can conditional association between any two given nodes
most importantly, MRFs encode the independence structure of a set of random variables,
- i.e. using the previous example above, if we know that \( X_2=x_2 \), then we know that \( X_1 \) and \( X_3 \) are independent.

Pairwise MRFs

MRFs with an extra specification:

only potential functions over exactly two variables are allowed, potential functions defined thus: \( \psi_{i,j}: X \times X \rightarrow \mathbb {R}_+ \), \( \forall (i,j)\in E \)

Therefore PMRFs can be parameterized as a product of potential functions:

\( P(X=x)=\frac{1}{Z}\prod_i\phi_i(x_i)\prod_{(ij)}\phi_{ij}(x_i,x_j) \)

where \( \prod_i \) takes the product over all nodes \( i=1,2,\dots,N \), and \( \prod_{(ij)} \) takes the product over all distinct pairs of nodes \( i \) and \( j \) where \( j>i \), and Z is a normalizing constant often called the partition function.

Pairwise MRFs

The potential functions \( \phi(x) \) encode the potential or preference for a variable to be in a given state, e.g. \( \phi_i(x_i) \) encodes the node \( X_i \)'s preference to be in the observed state \( x_i \):

\( \phi_i(x_i)=0 \), tells us that \( X_i \) will never take the value \( x_i \), while
\( \phi_i(x_i)=1 \) indicates a preference to take any value, and
\( \phi_i(x_i)\rightarrow\infty \) indicates a preferences for \( X_i \) to always take value \( x_i \)

Pairwise MRFs

Similarly the pairwise potential functions encode the preferences of any two given nodes, \( X_i \) and \( X_j \) to be in respective states \( x_i \) and \( x_j \). As the computed result of a pairwise potential function grows, we'd therefore expect \( X_i \) to take \( x_i \) given that \( X_j \) takes \( x_j \), i.e.

\( \lim\limits_{\phi_{ij}(x_i,x_j)\rightarrow\infty} P(X_i=x_i|X_j=x_j)=1 \) and vice-versa.

A useful constraint is to have the productof all potential functions equal 1, which therefore gives us that the sum of the natural logarithms sum to 0:

\( \prod_{x_i}\phi_i(x_i)=1\implies\sum_{x_i}\ln\phi_i(x_i)=0, \hspace{.5cm} \forall x_i,x_j \) \( =\sum_{x_i}\ln\phi_{ij}(x_i,x_j)=\sum_{x_j}\ln\phi_{ij}(x_i,x_j)=0 \)

Pairwise MRFs

Using all of the above, we can find the distribution of any given node conditioned on the rest of the network.

\( P(X_{\setminus \{k,l\}}=x_{\setminus \{k,l\}})=\sum_{x_k,x_l}P(X=x) \)

\( =\frac{1}{Z}\prod_{i\notin \{k,l\}}\phi_i(x_i)\prod_{(ij\notin \{k,l\})}\phi_{ij}(x_i,x_j) \sum_{x_k,x_l}\left(\phi_k(x_k)\phi_l(x_l)\phi_{kl}(x_k,x_l)\right) \)

\( \prod_{ij\notin \{k,l\}}\phi_{ik}(x_i,x_j)\phi_{il}(x_i,x_l) \)

\( P(X_k=x_k,X_l=x_l|X_{\setminus \{k,l\}}=x_{\setminus \{k,l\}})=\frac{P(X=x)}{X_{\setminus \{k,l\}}=x_{\setminus \{k,l\}}} \)

\( =\frac{\phi_k^*(x_k)\phi_l^*(x_l)\phi_{kl}(x_k,x_l)}{\sum_{x_k,x_l}\phi_k^*(x_k)\phi_l^*(x_l)\phi_{kl}(x_k,x_l)} \)

which tells us that \( \phi_k^*(x_k)=\phi_k(x_k)\prod_{ij\notin \{k,l\}}\phi_{ik}(x_i,x_k) \), and \( \phi_l^*(x_l)=\phi_l(x_l)\prod_{ij\notin \{k,l\}}\phi_{il}(x_i,x_l) \)

Ising Model

If we apply another constraint to our PMRF, letting variables only take values 1 or -1, we obtain a binary PMRF, also known as the Ising model, which has a history in physics, and is used today in computer vision

Ising Model

We now have a convenient loglinear model where \( \tau \) and \( \omega \) are known as the threshold and network parameters respectively:

\( \ln\phi_i(x_i)=\tau_ix_i \)
\( \ln\phi_{i,j}(x_i,x_j)=\omega_{ij}x_i x_j \)

And our potential function is now: \( P(X=x)=\frac{1}{Z}\exp\left( \sum_i\tau_ix_i +\sum_{(i,j)}\omega_{ij}x_ix_j \right) \)

Ising Model

Let \( \tau_1=\tau_2=\tau_3=-0.1 \), and \( \omega_{12}=\omega_{23}=0.5 \), then:

taus <- rep(-.1,3)
omegas <- rep(0.5,2)
combinations <- function(n) {expand.grid(replicate(n, c(-1,1), simplify=FALSE))}
comb <- combinations(3) # 8 combinations

calc_example <- function(taus, omegas) {
  pot <- rep(0, 8)
  for (i in 1:nrow(comb)){
    pot[i] <- exp(sum(taus*comb[i,])+omegas[1]*comb[i,1]*comb[i,2]+omegas[2]*comb[i,2]*comb[i,3])
  }
  prob <- pot/sum(pot)
  res <- comb %>% cbind(pot, prob) %>% arrange(desc(prob))
  colnames(res) <- c("X1","X2","X3","Potential","Probability");  res
}

calc_example(taus,omegas)

  X1 X2 X3 Potential Probability
1 -1 -1 -1 3.6692967  0.35138083
2  1  1  1 2.0137527  0.19284189
3  1 -1 -1 1.1051709  0.10583387
4 -1 -1  1 1.1051709  0.10583387
5  1  1 -1 0.9048374  0.08664945
6 -1  1  1 0.9048374  0.08664945
7 -1  1 -1 0.4065697  0.03893411
8  1 -1  1 0.3328711  0.03187655

Ising Model

Given our threshold parameters, it's not surprising that the most probable state for the model is where all the nodes are equal to \( -1 \). We also see that the second most probable state is where all the nodes are \( 1 \), due to the strong network parameters.

If we change the network parameters to \( 0 \):

omegas <- rep(0,2)
calc_example(taus,omegas)

  X1 X2 X3 Potential Probability
1 -1 -1 -1 1.3498588  0.16622440
2  1 -1 -1 1.1051709  0.13609303
3 -1  1 -1 1.1051709  0.13609303
4 -1 -1  1 1.1051709  0.13609303
5  1  1 -1 0.9048374  0.11142355
6  1 -1  1 0.9048374  0.11142355
7 -1  1  1 0.9048374  0.11142355
8  1  1  1 0.7408182  0.09122588

IRT

In IRT, \( P(X_i=x_i) \) is modelled through item response functions
These functions depend on the vector (length \( M \)) \( \Theta \), abilities, which determine the responses on a set of binary response variables \( X \)

1PL IRT

1-parameter logistic (1PL) IRT uses a logistic item response function, and therefore:
\( P(X_i=x_i|\Theta=\theta)_{1PL}=\frac{\exp(x_i\alpha(\theta-\delta_i))}{\sum_{x_i}\exp(x_i\alpha(\theta-\delta_i))} \)

where \( \delta_i \) is a difficulty parameter and \( \alpha \) is a common discrimination parameter for all items.

2PL IRT

in 2PL IRT, \( \alpha \) is allowed to vary:
\( P(X_i=x_i|\Theta=\theta)_{2PL}=\frac{\exp(x_i\alpha_i(\theta-\delta_i))}{\sum_{x_i}\exp(x_i\alpha_i(\theta-\delta_i))} \)

And if we generalize this to more than one ability, i.e. \( M>1 \), then we get the 2PL Multidimensional IRT (MIRT)

2PL MIRT

\( P(X_i=x_i|\Theta=\theta)_{MIRT}=\frac{\exp(x_i(\alpha_i^T\theta-\delta_i))}{\sum_{x_i}\exp(x_i(\alpha_i^T\theta-\delta_i))} \)

The joint conditional probability of \( X \) can be written as:

\( P(X=x|\Theta=\theta)=\prod_iP(X_i=x_i|\Theta=\theta) \)

and thus the marginal probability can be obtained thus:

\( P(X=x|\Theta=\theta)=\int\limits_{-\infty}^\infty f(\theta)P(X=x|\Theta=\theta)d\theta \)

As \( M \) grows, it becomes harder to compute this, however if \( \Theta \) is chosen such that its posterior distribution given \( X=x \) takes a Gaussian form, it is possible to obtain a closed form solution to the above integrand. This closed form solution is the Ising model presented earlier.

Conclusion

The Ising model is equivalent to a 2PL MIRT model with a posterior Gaussian distribution on the latent traits.

This coincidence is promising for both researchers in education and psychology, who use the 2PL MIRT model, who can now benefit from existing work on the Ising model in Maths/Statistics.

Finally the Ising model lends itself well to MCMC methods where one estimates one node by conditioning on all the others because the alternative, estimating \( Z \), the partition function, requires a sum over all \( 2^n \) possible states of a network with \( n \) variables, which makes is intractable except for small datasets \( n<10 \).

I originally planned to explore Gibbs Sampling in Ising models, but didn't manage to in this project.