Daniel Njoo
Prof. Leise
Stochastic Processes
Mapping of the human genome began in 1990, and finished 11 years later in 2011.
Mapping of the human brain began in 2013… will finish _____???
In the meantime, psychometrics, the process of gleaning insight into human behavior from observed test results is the best tool we have
The usefulness of the equivalency between the Ising model, and the 2PL Multidimensional Item Response Theory model
Introduce the Ising model
Introduce the 2PL Multidimensional Item Response Theory (MIRT) model
3 approaches in (psychometric, but also extends to education) testing:
(1) Classical Test Theory: assumes after accounting for external conditions, every test taker has a true score
(2) Item Response Theory: assumes a latent variable model, AKA a common cause model, i.e. that test takers have a latent ability which can be measured regardless of a test's content— Spearman's \( G \) factor
(3) Network approach: assumes observed symptoms (e.g. lack of sleep) are not indicators of a latent disorder (e.g. depression), instead they are causally related
The problem with traditional IRT models, i.e. (LV models), is that it is:
(1) difficult to incorporate uncertainty item parameter estimates
(2) difficult to maintain simple parameter estimates as model complexity increases
So a Gibbs Sampling the conditional distribution of the parameters wold work great! Especially in avoiding excess computation as we'll see later. (NB didn't actually implement Gibbs Sampling in my project)
So if we reframe a traditional IRT model into a network, we get the benefits of the network approach, which means
Two popular model paradigms in psychometrics:
(1) Gaussian Graphical Model
(2) Ising Model
undirected graphical models composed of RVs that satisfy the Markov property:
Using common notation, a graph \( G=(V,E) \), consists of a set of vertices \( V=(1,2,\dots,N) \) and edges \( E \), where since the graph is undirected the following edges are the same \( (i,j)=(j,i) \) where \( i,j\in V \).
In the above example G consists of three nodes, \( V=\{1,2,3\} \), and two edges, \( E=\{(1,2),(2,3)\} \).
MRFs give us two important corollaries:
(1) Any two non-adjacent nodes are conditionally independent given all other variables.
\[ {X_{u}\perp \!\!\!\perp X_{v}\mid X_{\setminus \{u,v\}}} \]
(2) Any node is conditionally independent of other variables given its neighbors. \[ X_{v}\perp \!\!\!\perp X_{\setminus \{v\}\mid X_{{N} (v)}} \]
Therefore:
MRFs with an extra specification:
Therefore PMRFs can be parameterized as a product of potential functions:
\( P(X=x)=\frac{1}{Z}\prod_i\phi_i(x_i)\prod_{(ij)}\phi_{ij}(x_i,x_j) \)
where \( \prod_i \) takes the product over all nodes \( i=1,2,\dots,N \), and \( \prod_{(ij)} \) takes the product over all distinct pairs of nodes \( i \) and \( j \) where \( j>i \), and Z is a normalizing constant often called the partition function.
The potential functions \( \phi(x) \) encode the potential or preference for a variable to be in a given state, e.g. \( \phi_i(x_i) \) encodes the node \( X_i \)'s preference to be in the observed state \( x_i \):
Similarly the pairwise potential functions encode the preferences of any two given nodes, \( X_i \) and \( X_j \) to be in respective states \( x_i \) and \( x_j \). As the computed result of a pairwise potential function grows, we'd therefore expect \( X_i \) to take \( x_i \) given that \( X_j \) takes \( x_j \), i.e.
\( \lim\limits_{\phi_{ij}(x_i,x_j)\rightarrow\infty} P(X_i=x_i|X_j=x_j)=1 \) and vice-versa.
A useful constraint is to have the productof all potential functions equal 1, which therefore gives us that the sum of the natural logarithms sum to 0:
\( \prod_{x_i}\phi_i(x_i)=1\implies\sum_{x_i}\ln\phi_i(x_i)=0, \hspace{.5cm} \forall x_i,x_j \) \( =\sum_{x_i}\ln\phi_{ij}(x_i,x_j)=\sum_{x_j}\ln\phi_{ij}(x_i,x_j)=0 \)
Using all of the above, we can find the distribution of any given node conditioned on the rest of the network.
\( P(X_{\setminus \{k,l\}}=x_{\setminus \{k,l\}})=\sum_{x_k,x_l}P(X=x) \)
\( =\frac{1}{Z}\prod_{i\notin \{k,l\}}\phi_i(x_i)\prod_{(ij\notin \{k,l\})}\phi_{ij}(x_i,x_j) \sum_{x_k,x_l}\left(\phi_k(x_k)\phi_l(x_l)\phi_{kl}(x_k,x_l)\right) \)
\( \prod_{ij\notin \{k,l\}}\phi_{ik}(x_i,x_j)\phi_{il}(x_i,x_l) \)
\( P(X_k=x_k,X_l=x_l|X_{\setminus \{k,l\}}=x_{\setminus \{k,l\}})=\frac{P(X=x)}{X_{\setminus \{k,l\}}=x_{\setminus \{k,l\}}} \)
\( =\frac{\phi_k^*(x_k)\phi_l^*(x_l)\phi_{kl}(x_k,x_l)}{\sum_{x_k,x_l}\phi_k^*(x_k)\phi_l^*(x_l)\phi_{kl}(x_k,x_l)} \)
which tells us that \( \phi_k^*(x_k)=\phi_k(x_k)\prod_{ij\notin \{k,l\}}\phi_{ik}(x_i,x_k) \), and \( \phi_l^*(x_l)=\phi_l(x_l)\prod_{ij\notin \{k,l\}}\phi_{il}(x_i,x_l) \)
If we apply another constraint to our PMRF, letting variables only take values 1 or -1, we obtain a binary PMRF, also known as the Ising model, which has a history in physics, and is used today in computer vision
We now have a convenient loglinear model where \( \tau \) and \( \omega \) are known as the threshold and network parameters respectively:
And our potential function is now: \( P(X=x)=\frac{1}{Z}\exp\left( \sum_i\tau_ix_i +\sum_{(i,j)}\omega_{ij}x_ix_j \right) \)
Let \( \tau_1=\tau_2=\tau_3=-0.1 \), and \( \omega_{12}=\omega_{23}=0.5 \), then:
taus <- rep(-.1,3)
omegas <- rep(0.5,2)
combinations <- function(n) {expand.grid(replicate(n, c(-1,1), simplify=FALSE))}
comb <- combinations(3) # 8 combinations
calc_example <- function(taus, omegas) {
pot <- rep(0, 8)
for (i in 1:nrow(comb)){
pot[i] <- exp(sum(taus*comb[i,])+omegas[1]*comb[i,1]*comb[i,2]+omegas[2]*comb[i,2]*comb[i,3])
}
prob <- pot/sum(pot)
res <- comb %>% cbind(pot, prob) %>% arrange(desc(prob))
colnames(res) <- c("X1","X2","X3","Potential","Probability"); res
}
calc_example(taus,omegas)
X1 X2 X3 Potential Probability
1 -1 -1 -1 3.6692967 0.35138083
2 1 1 1 2.0137527 0.19284189
3 1 -1 -1 1.1051709 0.10583387
4 -1 -1 1 1.1051709 0.10583387
5 1 1 -1 0.9048374 0.08664945
6 -1 1 1 0.9048374 0.08664945
7 -1 1 -1 0.4065697 0.03893411
8 1 -1 1 0.3328711 0.03187655
Given our threshold parameters, it's not surprising that the most probable state for the model is where all the nodes are equal to \( -1 \). We also see that the second most probable state is where all the nodes are \( 1 \), due to the strong network parameters.
If we change the network parameters to \( 0 \):
omegas <- rep(0,2)
calc_example(taus,omegas)
X1 X2 X3 Potential Probability
1 -1 -1 -1 1.3498588 0.16622440
2 1 -1 -1 1.1051709 0.13609303
3 -1 1 -1 1.1051709 0.13609303
4 -1 -1 1 1.1051709 0.13609303
5 1 1 -1 0.9048374 0.11142355
6 1 -1 1 0.9048374 0.11142355
7 -1 1 1 0.9048374 0.11142355
8 1 1 1 0.7408182 0.09122588
1-parameter logistic (1PL) IRT uses a logistic item response function, and therefore:
\( P(X_i=x_i|\Theta=\theta)_{1PL}=\frac{\exp(x_i\alpha(\theta-\delta_i))}{\sum_{x_i}\exp(x_i\alpha(\theta-\delta_i))} \)
where \( \delta_i \) is a difficulty parameter and \( \alpha \) is a common discrimination parameter for all items.
in 2PL IRT, \( \alpha \) is allowed to vary:
\( P(X_i=x_i|\Theta=\theta)_{2PL}=\frac{\exp(x_i\alpha_i(\theta-\delta_i))}{\sum_{x_i}\exp(x_i\alpha_i(\theta-\delta_i))} \)
And if we generalize this to more than one ability, i.e. \( M>1 \), then we get the 2PL Multidimensional IRT (MIRT)
\( P(X_i=x_i|\Theta=\theta)_{MIRT}=\frac{\exp(x_i(\alpha_i^T\theta-\delta_i))}{\sum_{x_i}\exp(x_i(\alpha_i^T\theta-\delta_i))} \)
The joint conditional probability of \( X \) can be written as:
\( P(X=x|\Theta=\theta)=\prod_iP(X_i=x_i|\Theta=\theta) \)
and thus the marginal probability can be obtained thus:
\( P(X=x|\Theta=\theta)=\int\limits_{-\infty}^\infty f(\theta)P(X=x|\Theta=\theta)d\theta \)
As \( M \) grows, it becomes harder to compute this, however if \( \Theta \) is chosen such that its posterior distribution given \( X=x \) takes a Gaussian form, it is possible to obtain a closed form solution to the above integrand. This closed form solution is the Ising model presented earlier.
The Ising model is equivalent to a 2PL MIRT model with a posterior Gaussian distribution on the latent traits.
This coincidence is promising for both researchers in education and psychology, who use the 2PL MIRT model, who can now benefit from existing work on the Ising model in Maths/Statistics.
Finally the Ising model lends itself well to MCMC methods where one estimates one node by conditioning on all the others because the alternative, estimating \( Z \), the partition function, requires a sum over all \( 2^n \) possible states of a network with \( n \) variables, which makes is intractable except for small datasets \( n<10 \).
I originally planned to explore Gibbs Sampling in Ising models, but didn't manage to in this project.