Q-matrix and variational autoencoders to estimate multidimensional item response theory models with correlated and independent latent variables

ML Reading Group - AGenCy Lab(Dhaka, Bangladesh)

Mahbubul Hasan

October 23, 2021

Plan for this lecture

Data description
Generative model, why generative model?
Autoencoder, Variational Autoencoder(VAE)
Q-matrix & how to use it in VAE model?
Multidimensional Logistic 2-Parameter (ML2P) Model
VAE demo in R
VAE/deep learning frontiers in educational assessment
Future directions or extensions

Which face is fake?

[Image:MITS191]

Data

ML2Pvae(Variational Autoencoder Models for IRT Parameter Estimation) R package(simulated)
data(response)
30 item exam which assesses 3 latent traits.
N= 5000 students, found in the data frame theta_true, were sampled from \(N(0,Σ)\).
\(Σ\) specifies the correlations between the 3 abilities, and is found in the data frame correlation_matrix.

Discrimination: \([0.25, 1.75]\), difficulty paramters:\([−3, 3]\), Q-matrix:\(Bern(0.35)\). sampled uniformly.

Probabilities for each student answering each question correctly were calculated with the ML2P model(McKinley and Reckase,1980).

DataFrame(head)

data(response)

head(q-matrix)

Unsupervised learning

Data: Y is the data, no labels exist.
Objective: to explore and learn hidden and underlying structure of the data
Example: dimensionality reduction, clustering

Genarative model

generative means once we learn the underlying structure of the data we can create an new example that does not exist
Goal:Take as input training samples from some distribution and learn a model that represents that distribution
The generative modeling process [O.Reilly,2021]

Why genarative model?

Facial classification, Debiasing
Able to uncover underlying latent variables in the dataset

Outliers detection
detect outliers in the distribution and use outliers during training to improve even more!

[Amini,2019]

What is latent variable?

Mythe of the cave-

Shadows are the observed variable and latent variables are underlying object genrating shadows
latent variables in this case(response data) are the knowledge components needed to solve the problem(e.g. questions)

What is autoencoder?

Autoencoders are a special class of neural networks in which the input and output layers are the same.

An AE consists of two neural networks: an encoder and a decoder. The encoder takes the high-dimensional input \(x\), and feeds forward through one or more hidden layers to some latent space.

The decoder takes in this **latent dimension, and feeds forward through hidden layers to output a reconstruction of the original input, \(\hat{X}\).

AE model

[Rocca,2019]

Why do we care about low dimentional `Z`?

Lower dimensional feature representation from unlabeled data
Reduce noise, use most meaningful features from distribution
train the model to use features to reconstruct the original data.
bottleneck hidden layers

What is VAE?

VAE is probabilistic twist on autoencoder.
VAE still consists of an encoder and decoder, but the latent space returned from the encoder is trained to learn a probability distribution for \(\Theta\) given \(X\).
VAE allows the network to map the training data to a (latent) normal distribution.

sample from this distribution, and feed forward that sample through the decoder

VAE model(MNIST data)

VAE optimization

Training VAE model [Doersch, 2016]

Left is without the “reparameterization trick”, and right is with it.
Red shows sampling operations that are non-differentiable. Blue shows loss layers.
The feed forward behavior of these networks is identical, but backpropagation can be applied only to the right network.

Reparametrizing the sampling lyer

Reparametrizing [MIT-S191]
Enables to train VAE end to end by backpropagation with respect to Z and actual weights of the encoder.

VAE: latent purturbation during training

purturbation: a small change in a system which can be as a result of a third object interacting with the system
slowly increase or decrease the latent variable, keep other variales fixed.
Disentanglement

VAE summary

Compress representation of information to something we can use to learn
Reconstruction allows for unsupervised learning(one label in data no Y)
Reparametrizing trick to train end to end
Use perturbation to interpret hidden lyers
Generate new set of data/example.

What is Q-matrix and What can we do with Q-matrix?

Shows the relationship between test items and latent or underlying attributes, or concepts.

Q-matrix is a MxN matrix, where M equals the number of questions in an assessment, and N equals the total number of concepts required for understanding all questions.

Q-matrix can be used for understanding students’ performance
by assigning the closest ideal response to a student’s response vector, it can be assumed which concepts the student does, and which s/he does not know.

Q-matrix & how to use it in VAE model?

AE and VAE architectures are combined with the ML2P model in the following ways-

no hidden layer in the decoder, a sigmoidal activation function on the output layer nodes (with non-negative weights).
Q-matrix to determine the connections between the latent traits and the output items.

VAE model [Converse et al.,2021]

What is difficulty and discrimination perameter?

Reflects students varied levels of understanding
difficulty
Difficulty parameters can be any number, but are usually between [-3,3]

Discrimination parameters

be fair- students’ with better understanding should do better in test
discrimination
Discrimination parameters are always positive, usually between [0.25, 3]

Multidimensional item response theory(MIRT)

naive approach to quantifying student knowledge is to look at the percentage of questions that the student answered correctly
this does not take into account the fact that each item on the assessment is different-both in difficulty and in content

For example, if one student answers only questions 1 and 4 incorrect, and another student answers only questions 3 and 7 incorrect, they have the same percentage score. But it is not likely that the two students share the same latent trait values. Questions 3 and 7 may have tested a different skill than items 1 and 4, and could vary greatly in difficulty level.

in this place MIRT models is useful and play major role in educational measurement.

Multidimensional Logistic 2-Parameter (ML2P) model

ML2P model gives the probability of students answering a particular question as a continuous function of student ability (McKinley and Reckase, 1980).
ML2p model: \(P(u_{ij}= 1\mid \Theta_j;a_i,b_i)=\frac{1}{1+exp\Big[-\sum_{k=1}^k a_{ji}\theta_{jk} + b_i\Big]}\)
difficulty parameter bi for item i,
discrimination parameter \(a_{ik}≥0\) for each latent trait k quantifying the level of ability k required to answer item i correctly.
ML2P model gives the probability of student j with latent abilities \(𝛩_j = (𝜃_{j_1},...,𝜃_{jK})^⊤\) answering item i correctly as

ML2P-VAE model

no hidden layers in the decoder. Instead, the non-zero weights in the decoder are determined by a given Q-matrix
next, connect the encoded distribution layer directly to the output layer. The output layer must use the sigmoidal activation function-

\(Q(zi) = \frac{1}{1+e^{-z_i}}\)

VAE model [Converse et al.,2021]
We could try inputting non-binary responses (partial credit) into the ML2P-VAE model, but it wouldn’t have as much theory backing it. The Samejima graded response model would be a good fit

Load data

Model parameters

num_items and num_skills describe the assessment length and the number of abilities being evaluated by the assessment.
Q_matrix is a num_skills by num_items matrix which specifies the relationship between items and abilities.

ML2P-VAE model inside R code example

models_ind <- build_vae_independent(num_items,
                                    num_skills,
                                    Q,
                                    model_type = 2,
                                    enc_hid_arch = enc_arch,
                                    hid_enc_activation = enc_act,
                                    output_activation = out_act)
encoder_ind <- models_ind[[1]]
decoder_ind <- models_ind[[2]]
vae_ind <- models_ind[[3]]
encoder_ind

## Model
## Model: "model"
## ________________________________________________________________________________
## Layer (type)              Output Shape      Param #  Connected to               
## ================================================================================
## input (InputLayer)        [(None, 30)]      0                                   
## ________________________________________________________________________________
## hidden_1 (Dense)          (None, 16)        496      input[0][0]                
## ________________________________________________________________________________
## hidden_2 (Dense)          (None, 8)         136      hidden_1[0][0]             
## ________________________________________________________________________________
## z_mean (Dense)            (None, 3)         27       hidden_2[0][0]             
## ________________________________________________________________________________
## z_log_var (Dense)         (None, 3)         27       hidden_2[0][0]             
## ________________________________________________________________________________
## z (Concatenate)           (None, 6)         0        z_mean[0][0]               
##                                                      z_log_var[0][0]            
## ________________________________________________________________________________
## lambda (Lambda)           (None, 3)         0        z[0][0]                    
## ================================================================================
## Total params: 686
## Trainable params: 686
## Non-trainable params: 0
## ________________________________________________________________________________

decoder_ind

## Model
## Model: "model_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## latent_inputs (InputLayer)          [(None, 3)]                     0           
## ________________________________________________________________________________
## vae_out (Dense)                     (None, 30)                      120         
## ================================================================================
## Total params: 120
## Trainable params: 120
## Non-trainable params: 0
## ________________________________________________________________________________

vae_ind

## Model
## Model: "model_2"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## input (InputLayer)                  [(None, 30)]                    0           
## ________________________________________________________________________________
## model (Model)                       [(None, 3), (None, 3), (None, 3 686         
## ________________________________________________________________________________
## model_1 (Model)                     (None, 30)                      120         
## ================================================================================
## Total params: 806
## Trainable params: 806
## Non-trainable params: 0
## ________________________________________________________________________________

Training

Model training loss

Get parameter estimates for Model after training

get_item_parameter_estimates() all trainable parameters of the decoder part of the VAE and returns the values which serve as estimates to the item paramters.

Load in true values (included in this pacakge)

Examine model estimates

build_vae_independent(), and assumes that the latent traits are independent of one another.

Examine model estimates(continue…)

Correlation plots of discrimination parameter estimates for data set (i) with 30 items and 3 latent traits.
Each color represents discrimination parameters relating one of the 3 latent skills.

Difficulty parameters

Difficulty parameters are on the item level, not the latent trait level. So each item i has exactly one difficulty parameter b_i, regardless of the number of latent skills.

Ability parameter

The three colors in the plot represent discrimination and ability parameters associated with the three latent traits

Corr latent traits model

most effective when parameters specify a more compilcated distribution, such as \(N (μ, Σ) or N (0, Σ)\)
use TensorFlow probability library(this is not working in my computer)

Corr latent traits

VAE for educational assessment(summary)

ML2P-VAE methods are most useful on high-dimensional data
Faster in runtime
Reserve data privacy
Students’ latent abilities estimation are often more accurate when using ML2P-VAE methods.
This is especially interesting, as the estimates \(𝛩_j\) are not updated in the iterations of a gradient descent algorithm, while the estimates to \(a_{ik}\) and \(b_i\) are.
Highly interpretable model.
During estimation of difficulty parameters, the improvement gained from using traditional methods is incredibly small. [Converse et. al. 2021]

VAE/deep learning frontiers in educational assessment

Autoencoders for Educational Assessment

https://doi.org/10.1007/978-3-030-23207-8_8

Table 1 shows that for parameter recovery, the VAE has much smaller error terms and much higher correlations.
Result is corroborated by the correlation plots between the true discrimination of parameters and the weights of the decoder.

VAE for Cognitive Models

DOI:10.1109/IJCNN.2019.8852333

VAE model is proposed for educational cognitive assessment
Established relationship with famous IRT models and VAE

.

Extended classic IRT response models with deep neural network components/VAE

Feb 2020 study by Stanford

https://arxiv.org/abs/2002.00276

Variational inference provides a potential solution, running orders of magnitude faster than MCMC algorithms while matching their state-of-the-art accuracy.

this approach allows to natural extend IRT with non-linearities modeled via deep neural networks.

Bayesian inference algorithm for IRT, and show that it is fast and scalable without sacrificing accuracy

.

Incorporate a Q-matrix into an existing MIRT model

DOI: 10.1177/0013164418814898

Introduced how to incorporate a Q-matrix into an existing MIRT model, and demonstrated benefits of the proposed hybrid model through two simulation studies and an applied study.

! .

VAE in the estimation of MIRT models

https://doi.org/10.1007/s10994-021-06005-7

proposed an alternative network architecture and regularization term in the loss function, which allows for correlated latent skills.

VAE can be fit to a more general multivariate Gaussian prior distribution with a non-identity correlation matrix.

experiments on four different datasets, both real and simulated, which are much larger than those used in Curi et al. (2019) demonstrating the ability of the proposed method to handle high-dimensional data.
compare results with those of traditional IRT parameter estimation methods, displaying the considerable

VAE to estimate item parameters and correlated latent abilities, and directly compare the ML2P-VAE method to more traditional parameter estimation methods, e.g., Monte Carlo expectation-maximization.

.

Other thoughts/comments

This was a very basic example, and results can be improved by fine-tuning the network architecture.
Though real-world datasets likely won’t have “true” values of the parameters available, simulated dataset includes them for reference.

Future directions

Data anonymization via VAE to protect learners privacy.
Use of other distribution in the VAE model e.g., exponential families
Compare ML2P-VAE methods with such sampling based parameter estimation approaches e.g.,Gibbs sampling, Hamiltonian Monte Carlo, and Metropolis-Hastings,the Robbins-Monroe algorithm algorithms in the future.
Extend ML2P-VAE method to ML3P-VAE model.

Thank you!

Questions?

email: mhasan1@memphis.edu
www.mahbubulhasan.net

Q-matrix and variational autoencoders to estimate multidimensional item response theory models with correlated and independent latent variables

ML Reading Group - AGenCy Lab(Dhaka, Bangladesh)

Plan for this lecture

Which face is fake?

Data

DataFrame(head)

Unsupervised learning

Genarative model

Why genarative model?

What is latent variable?

What is autoencoder?

Why do we care about low dimentional Z?

What is VAE?

VAE optimization

Reparametrizing the sampling lyer

VAE: latent purturbation during training

VAE summary

What is Q-matrix and What can we do with Q-matrix?

Q-matrix & how to use it in VAE model?

What is difficulty and discrimination perameter?

Discrimination parameters

Multidimensional item response theory(MIRT)

Multidimensional Logistic 2-Parameter (ML2P) model

ML2P-VAE model

Load data

Model parameters

ML2P-VAE model inside R code example

decoder_ind

vae_ind

Training

Model training loss

Get parameter estimates for Model after training

Load in true values (included in this pacakge)

Examine model estimates

Examine model estimates(continue…)

Difficulty parameters

Ability parameter

Corr latent traits model

VAE for educational assessment(summary)

VAE/deep learning frontiers in educational assessment

Autoencoders for Educational Assessment

https://doi.org/10.1007/978-3-030-23207-8_8

VAE for Cognitive Models

DOI:10.1109/IJCNN.2019.8852333

Extended classic IRT response models with deep neural network components/VAE

Feb 2020 study by Stanford

https://arxiv.org/abs/2002.00276

Incorporate a Q-matrix into an existing MIRT model

DOI: 10.1177/0013164418814898

VAE in the estimation of MIRT models

https://doi.org/10.1007/s10994-021-06005-7

Other thoughts/comments

Future directions

Thank you!

Questions?

Why do we care about low dimentional `Z`?