Q-matrix and variational autoencoders to estimate multidimensional item response theory models with correlated and independent latent variables
ML Reading Group - AGenCy Lab(Dhaka, Bangladesh)
Mahbubul Hasan
October 23, 2021
Plan for this lecture
- Data description
- Generative model, why generative model?
- Autoencoder, Variational Autoencoder(VAE)
- Q-matrix & how to use it in VAE model?
- Multidimensional Logistic 2-Parameter (ML2P) Model
- VAE demo in R
- VAE/deep learning frontiers in educational assessment
- Future directions or extensions
Which face is fake?
[Image:MITS191]
Data
- ML2Pvae(Variational Autoencoder Models for IRT Parameter Estimation) R package(simulated)
- data(response)
- 30 item exam which assesses 3 latent traits.
- N= 5000 students, found in the data frame
theta_true, were sampled from \(N(0,Σ)\).
- \(Σ\) specifies the correlations between the 3 abilities, and is found in the data frame
correlation_matrix.
Discrimination: \([0.25, 1.75]\), difficulty paramters:\([−3, 3]\), Q-matrix:\(Bern(0.35)\). sampled uniformly.
- Probabilities for each student answering each question correctly were calculated with the
ML2P model(McKinley and Reckase,1980).
DataFrame(head)
- data(response)

- head(q-matrix)

Unsupervised learning
- Data: Y is the data, no labels exist.
- Objective: to explore and learn hidden and underlying structure of the data
- Example: dimensionality reduction, clustering
Genarative model
- generative means once we learn the underlying structure of the data we can create an new example that does not exist
- Goal:Take as input training samples from some distribution and learn a model that represents that distribution
- The generative modeling process
[O.Reilly,2021]
Why genarative model?
- Facial classification, Debiasing
- Able to uncover underlying latent variables in the dataset

- Outliers detection
- detect outliers in the distribution and use outliers during training to improve even more!

[Amini,2019]
What is latent variable?
- Mythe of the cave-

- Shadows are the observed variable and latent variables are underlying object genrating shadows
- latent variables in this case(response data) are the knowledge components needed to solve the problem(e.g. questions)
What is autoencoder?
- Autoencoders are a special class of neural networks in which the input and output layers are the same.
- An AE consists of two neural networks: an encoder and a decoder. The encoder takes the high-dimensional input \(x\), and feeds forward through one or more hidden layers to some latent space.
- The decoder takes in this **latent dimension, and feeds forward through hidden layers to output a reconstruction of the original input, \(\hat{X}\).
- AE model

[Rocca,2019]
Why do we care about low dimentional Z?
- Lower dimensional feature representation from unlabeled data
- Reduce noise, use most meaningful features from distribution
- train the model to use features to reconstruct the original data.
- bottleneck hidden layers
What is VAE?
- VAE is probabilistic twist on autoencoder.
- VAE still consists of an encoder and decoder, but the latent space returned from the encoder is trained to learn a probability distribution for \(\Theta\) given \(X\).
- VAE allows the network to map the training data to a (latent) normal distribution.
- sample from this distribution, and feed forward that sample through the decoder
- VAE model(MNIST data)

VAE optimization
- Training VAE model
[Doersch, 2016]
- Left is without the “reparameterization trick”, and right is with it.
- Red shows sampling operations that are non-differentiable. Blue shows loss layers.
- The feed forward behavior of these networks is identical, but backpropagation can be applied only to the right network.
Reparametrizing the sampling lyer
- Reparametrizing
[MIT-S191]
- Enables to train VAE end to end by backpropagation with respect to
Z and actual weights of the encoder.
VAE: latent purturbation during training
- purturbation: a small change in a system which can be as a result of a third object interacting with the system
- slowly increase or decrease the latent variable, keep other variales fixed.
- Disentanglement

VAE summary
- Compress representation of information to something we can use to learn
- Reconstruction allows for unsupervised learning(one label in data no Y)
- Reparametrizing trick to train end to end
- Use perturbation to interpret hidden lyers
- Generate new set of data/example.
What is Q-matrix and What can we do with Q-matrix?
- Shows the relationship between test items and latent or underlying attributes, or concepts.
- Q-matrix is a MxN matrix, where M equals the number of questions in an assessment, and N equals the total number of concepts required for understanding all questions.

- Q-matrix can be used for understanding students’ performance
- by assigning the closest ideal response to a student’s response vector, it can be assumed which concepts the student does, and which s/he does not know.
Q-matrix & how to use it in VAE model?
- AE and VAE architectures are combined with the ML2P model in the following ways-
- no hidden layer in the decoder, a sigmoidal activation function on the output layer nodes (with non-negative weights).
Q-matrix to determine the connections between the latent traits and the output items.
- VAE model
[Converse et al.,2021]
What is difficulty and discrimination perameter?
- Reflects students varied levels of understanding
- difficulty

- Difficulty parameters can be any number, but are usually between [-3,3]
Discrimination parameters
- be fair- students’ with better understanding should do better in test
- discrimination

- Discrimination parameters are always positive, usually between [0.25, 3]
Multidimensional item response theory(MIRT)
- naive approach to quantifying student knowledge is to look at the percentage of questions that the student answered correctly
- this does not take into account the fact that each item on the assessment is different-both in difficulty and in content
For example, if one student answers only questions 1 and 4 incorrect, and another student answers only questions 3 and 7 incorrect, they have the same percentage score. But it is not likely that the two students share the same latent trait values. Questions 3 and 7 may have tested a different skill than items 1 and 4, and could vary greatly in difficulty level.
- in this place MIRT models is useful and play major role in educational measurement.
Multidimensional Logistic 2-Parameter (ML2P) model
- ML2P model gives the probability of students answering a particular question as a continuous function of student ability (McKinley and Reckase, 1980).
- ML2p model: \(P(u_{ij}= 1\mid \Theta_j;a_i,b_i)=\frac{1}{1+exp\Big[-\sum_{k=1}^k a_{ji}\theta_{jk} + b_i\Big]}\)
- difficulty parameter
bi for item i,
- discrimination parameter \(a_{ik}≥0\) for each latent trait
k quantifying the level of ability k required to answer item i correctly.
- ML2P model gives the probability of student
j with latent abilities \(𝛩_j = (𝜃_{j_1},...,𝜃_{jK})^⊤\) answering item i correctly as
ML2P-VAE model
- no hidden layers in the decoder. Instead, the non-zero weights in the decoder are determined by a given Q-matrix
- next, connect the encoded distribution layer directly to the output layer. The output layer must use the sigmoidal activation function-
\(Q(zi) = \frac{1}{1+e^{-z_i}}\)
- VAE model
[Converse et al.,2021]
- We could try inputting non-binary responses (partial credit) into the ML2P-VAE model, but it wouldn’t have as much theory backing it. The Samejima graded response model would be a good fit
Load data
Model parameters
num_items and num_skills describe the assessment length and the number of abilities being evaluated by the assessment.
Q_matrix is a num_skills by num_items matrix which specifies the relationship between items and abilities.
ML2P-VAE model inside R code example
models_ind <- build_vae_independent(num_items,
num_skills,
Q,
model_type = 2,
enc_hid_arch = enc_arch,
hid_enc_activation = enc_act,
output_activation = out_act)
encoder_ind <- models_ind[[1]]
decoder_ind <- models_ind[[2]]
vae_ind <- models_ind[[3]]
encoder_ind
## Model
## Model: "model"
## ________________________________________________________________________________
## Layer (type) Output Shape Param # Connected to
## ================================================================================
## input (InputLayer) [(None, 30)] 0
## ________________________________________________________________________________
## hidden_1 (Dense) (None, 16) 496 input[0][0]
## ________________________________________________________________________________
## hidden_2 (Dense) (None, 8) 136 hidden_1[0][0]
## ________________________________________________________________________________
## z_mean (Dense) (None, 3) 27 hidden_2[0][0]
## ________________________________________________________________________________
## z_log_var (Dense) (None, 3) 27 hidden_2[0][0]
## ________________________________________________________________________________
## z (Concatenate) (None, 6) 0 z_mean[0][0]
## z_log_var[0][0]
## ________________________________________________________________________________
## lambda (Lambda) (None, 3) 0 z[0][0]
## ================================================================================
## Total params: 686
## Trainable params: 686
## Non-trainable params: 0
## ________________________________________________________________________________
decoder_ind
## Model
## Model: "model_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## latent_inputs (InputLayer) [(None, 3)] 0
## ________________________________________________________________________________
## vae_out (Dense) (None, 30) 120
## ================================================================================
## Total params: 120
## Trainable params: 120
## Non-trainable params: 0
## ________________________________________________________________________________
vae_ind
## Model
## Model: "model_2"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## input (InputLayer) [(None, 30)] 0
## ________________________________________________________________________________
## model (Model) [(None, 3), (None, 3), (None, 3 686
## ________________________________________________________________________________
## model_1 (Model) (None, 30) 120
## ================================================================================
## Total params: 806
## Trainable params: 806
## Non-trainable params: 0
## ________________________________________________________________________________
Training
Model training loss

Get parameter estimates for Model after training
get_item_parameter_estimates() all trainable parameters of the decoder part of the VAE and returns the values which serve as estimates to the item paramters.
Load in true values (included in this pacakge)
Examine model estimates
build_vae_independent(), and assumes that the latent traits are independent of one another.
Examine model estimates(continue…)
- Correlation plots of discrimination parameter estimates for data set (i) with 30 items and 3 latent traits.
- Each color represents discrimination parameters relating one of the 3 latent skills.

Difficulty parameters
- Difficulty parameters are on the item level, not the latent trait level. So each item i has exactly one difficulty parameter b_i, regardless of the number of latent skills.

Ability parameter
- The three colors in the plot represent discrimination and ability parameters associated with the three latent traits

Corr latent traits model
- most effective when parameters specify a more compilcated distribution, such as \(N (μ, Σ) or N (0, Σ)\)
- use
TensorFlow probability library(this is not working in my computer)
VAE for educational assessment(summary)
- ML2P-VAE methods are most useful on high-dimensional data
- Faster in runtime
- Reserve data privacy
- Students’ latent abilities estimation are often more accurate when using ML2P-VAE methods.
- This is especially interesting, as the estimates \(𝛩_j\) are not updated in the iterations of a gradient descent algorithm, while the estimates to \(a_{ik}\) and \(b_i\) are.
- Highly interpretable model.
- During estimation of difficulty parameters, the improvement gained from using traditional methods is incredibly small. [Converse et. al. 2021]
VAE/deep learning frontiers in educational assessment
Autoencoders for Educational Assessment
- Table 1 shows that for parameter recovery, the VAE has much smaller error terms and much higher correlations.
- Result is corroborated by the correlation plots between the true discrimination of parameters and the weights of the decoder.

VAE for Cognitive Models
- VAE model is proposed for educational cognitive assessment
- Established relationship with famous IRT models and VAE
Extended classic IRT response models with deep neural network components/VAE
Feb 2020 study by Stanford
- Variational inference provides a potential solution, running orders of magnitude faster than MCMC algorithms while matching their state-of-the-art accuracy.
- this approach allows to natural extend IRT with non-linearities modeled via deep neural networks.
- Bayesian inference algorithm for IRT, and show that it is fast and scalable without sacrificing accuracy
Incorporate a Q-matrix into an existing MIRT model
DOI: 10.1177/0013164418814898
- Introduced how to incorporate a Q-matrix into an existing MIRT model, and demonstrated benefits of the proposed hybrid model through two simulation studies and an applied study.
! .
VAE in the estimation of MIRT models
- proposed an alternative network architecture and regularization term in the loss function, which allows for correlated latent skills.
- VAE can be fit to a more general multivariate Gaussian prior distribution with a non-identity correlation matrix.
- experiments on four different datasets, both real and simulated, which are much larger than those used in Curi et al. (2019) demonstrating the ability of the proposed method to handle high-dimensional data.
- compare results with those of traditional IRT parameter estimation methods, displaying the considerable
- VAE to estimate item parameters and correlated latent abilities, and directly compare the ML2P-VAE method to more traditional parameter estimation methods, e.g., Monte Carlo expectation-maximization.
Future directions
- Data anonymization via VAE to protect learners privacy.
- Use of other distribution in the VAE model e.g., exponential families
- Compare ML2P-VAE methods with such sampling based parameter estimation approaches e.g.,Gibbs sampling, Hamiltonian Monte Carlo, and Metropolis-Hastings,the Robbins-Monroe algorithm algorithms in the future.
- Extend ML2P-VAE method to ML3P-VAE model.