Data

data set in the ML2Pvae(Variational Autoencoder Models for IRT Parameter Estimation) R package(simulated)

30 item exam which assesses 3 latent traits.
The latent abilities for 5000 students, found in the data frame theta_true, were sampled from \(N(0,Σ)\).
\(Σ\) specifies the correlations between the 3 abilities, and is found in the data frame correlation_matrix.

Discrimination and difficulty paramters were sampled uniformly from \([0.25, 1.75]\) and \([−3, 3]\) respectively, and entries in the Q-matrix were sampled from \(Bern(0.35)\).

These values can be found in the data frames disc_true, diff_true, and q_matrix.

Probabilities for each student answering each question correctly were calculated with the ML2P model(McKinley and Reckase,1980).

Unsupervised Learning

Data: Y is the data, no labels exist.
Objective: to explore and learn hidden and underlying structure of the data
Example: dimensionality reduction, clustering

Genarative Model

generative means once we learn the underlying structure of the data we can create an new example that does not exist
Goal:Take as input training samples from some distribution and learn a model that represents that distribution
The generative modeling process [O.Reilly,2021]

Why genarative model?

Facial classification, Debiasing
Able to uncover underlying latent variables in the dataset

Outliers detection
detect outliers in the distribution and use outliers during training to improve even more!

[Amini,2019]

What is latent variable?

Mythe of the cave-

Shadows are the observed variable and latent variables are underlying object genrating shadows

Auto encoder

Autoencoders are a special class of neural networks in which the input and output layers are the same.

An AE consists of two neural networks: an encoder and a decoder. The encoder takes the high-dimensional input \(x\), and feeds forward through one or more hidden layers to some latent space.

The decoder takes in this **latent dimension, and feeds forward through hidden layers to output a reconstruction of the original input, \(\hat{X}\).

AE model

[Rocca,2019]

why do we care about low dimentional `Z`?

Lower dimensional feature reprsentation from unlabeled data
Reduce noice, use most meaningful features from distribution
train the model to use features to reconstruct the orginal data.
bottleback hidden lyers

What is VAE?

VAE is probabilistc twist on autoencoder.
VAE still consists of an encoder and decoder, but the latent space returned from the encoder is trained to learn a probability distribution for \(\Theta\) given \(X\).
VAE allows the network to map the training data to a (latent) normal distribution.

sample from this distribution, and feed forward that sample through the decoder

VAE optimization

Training VAE model [Doersch, 2016]

Left is without the “reparameterization trick”, and right is with it.
Red shows sampling operations that are non-differentiable. Blue shows loss layers.
The feed forward behavior of these networks is identical, but backpropagation can be applied only to the right network.

Reparametrizing the sampling lyer

Reparametrizing [MIT-S191]

VAE summery

compress represnetation of information to something we can use to learn
Reconstruction allows for unsupervised learning(one label in data no Y)
Reparametrizing trick to to train end to end
use perturbation to interpret hidden lyers
generate new set of data/example.

VAE for Educatiuonal Assessment

multidimensional logistic 2-parameter \(ML2P\) model, designed to test subject’s performance in \(J\) latent traits on an assessment with I items. Denote \(\Theta_k = (\theta_{(k1)},...,\theta_{(kJ)}\) as the set of latent skills of subject \(k\).
VAE model [Converse et al.,2021]

Larger values of \(\theta_{(kJ)}\) represent a higher knowledge of that skill.

VAE for Educatiuonal Assessment(continue…)

binary variable \(X_{(ki)}\) denotes the student \(k\) answer to item \(i\).
The \(ML2P\) model defines the probability of success as

\(P(X_{(ki)})= 1 \mid \Theta_k=\frac{1}{1+exp\Big(\sum_{j=1}^J a_{(ji)}\theta_{(kj)} + b_i\Big)}\)

with discrimination parameters \(a_{ji}\) measuring the relation of latent skill j with assessment item i, and difficulty parameter bi , for \(i = 1, ..., I\) and \(j = 1, ..., J\).
assume that \(a_{ji} ≥ 0\) for all \(i, j;\) intuitively, this means that more skill in a latent area cannot decrease the probability of answering an item correctly.
if \(a_{ji} = 0,\) then item i does not require skill j to be answered correctly.

Q-matrix to keep track of latent skills with item.
(Q is an element of set of real number JI) \(Q \in \mathbb{R}^{J×I}\) by \(Q_{ji} = 1\) if item \(i\) requires skill \(j\), and \(Q_{ji} = 0\) otherwise.

What is Q-matrix and What can we do with Q-matrix?

shows the relationship between test items and latent or underlying attributes, or concepts.

Q-matrix is a MxN matrix, where M equals the number of questions in an assessment, and N equals the total number of concepts required for understanding all questions. VAE model

Q-matrix can be used for understanding students’ performance
by assigning the closest ideal response to a student’s response vector, it can be assumed which concepts the student does, and which s/he does not know.

Q-matrix & how to use it in VAE model?

AE and VAE architectures are combined with the ML2P model in the following ways-

no hidden layer in the decoder, a sigmoidal activation function on the output layer nodes (with non-negative weights).
\(Q-matrix\) to determine the connections between the latent traits and the output items.

VAE model [Converse et al.,2021]

What is difficulty and discrimination perameter?

Reflects students varied levels of understanding
difficulty

be fair- students’ with better understanding should do better in test
discrimination

Multidimensional item response theory(MIRT)

In MIRT, it is assumed that each student j possesses K latent abilities \(𝛩_j = (𝜃_{j1},...,𝜃_{jK})⊤\), which are not directly observable.
In real-world applications, the available data is often student j’s response set to an exam with n items, represented as a binary vector \(u_j ∈ R^n\).
An entry in uj equal to 1 corresponds with a correct answer, and an entry equal to 0 corresponds with an incorrect answer to a particular item of the exam.
Given student j’s assessment results \(u_j\), the goal in IRT parameter estimation is to infer the latent traits \(𝛩_j\).

MIRT

MIRT

For example, if one student answers only questions 1 and 4 incorrect, and another student answers only questions 3 and 7 incorrect, they have the same percentage score. But it is not likely that the two students share the same latent trait values. Questions 3 and 7 may have tested a different skill than items 1 and 4, and could vary greatly in difficulty level.

in this place MIRT models is useful and play major role in educational measurement.

Multidimensional Logistic 2-Parameter (ML2P) Model

ML2P model gives the probability of students answering a particular question as a continuous function of student ability (McKinley and Reckase, 1980).
two types of parameters associated with each item: a difficulty parameter bi for item i, and a discrimination parameter \(a_{ik} ≥ 0\) for each latent trait k quantifying the level of ability k required to answer item i correctly.
ML2P model gives the probability of student j with latent abilities \(𝛩_j = (𝜃_{j_1},...,𝜃_{jK})^⊤\) answering item i correctly as

ML2p model

ML2P-VAE model

First, there are no hidden layers in the decoder. Instead, the non-zero weights in the decoder are determined by a given Q-matrix
These connect the encoded distribution layer directly to the output layer. The output layer must use the sigmoidal activation function-

\(Q(zi) = \frac{1}{1+e^{-z_i}}\)

where \(z_i = \sum_{k=1}^K, w_k,iαk + bi\). Here, \(w_i,k\) is the weight between the k-th and i-th nodes in the second-to-last and output layer, respectively. \(αk\) is the activation of the k-th node in the second-to-last layer, and \(b_i\) is the bias value of the \(i-th\) node in the output layer.

VAE model [Converse et al.,2021]

Model parameters

num_items and num_skills describe the assessment length and the number of abilities being evaluated by the assessment.
Q_matrix is a num_skills by num_items matrix which specifies the relationship between items and abilities.

Q-matrix specifies the nonzero weights in the decoder of the VAE, which allows interpretation of network parameters as discrimination/difficulty parameter estimates and the hidden encoded layer as ability parameter estimates.

Without the Q_matrix, estimation of these parameters would not be possible.

Model 1: An ML2P-VAE model assuming latent traits are independent

After the first layers, extract the mean and log variance of this layer. next, create a z layer based on those two parameters to generate an input image.
model_type: can be either 1 or 2

enc_hid_arch: is a vector of integers, who’s length determines the number of hidden layers in the encoder, and the entries describe the number of nodes in the hidden layers.
one hidden layer in the encoder(default), with size halfway between num_items (the size of the input layer) and num_skills (the dimension of the encoded distribution).
hid_enc_activations: is a vector of strings, which specifies the activation functions to be used in each encoder hidden layer.

models_ind <- build_vae_independent(num_items,
                                    num_skills,
                                    Q,
                                    model_type = 2,
                                    enc_hid_arch = enc_arch,
                                    hid_enc_activation = enc_act,
                                    output_activation = out_act)
encoder_ind <- models_ind[[1]]
decoder_ind <- models_ind[[2]]
vae_ind <- models_ind[[3]]
encoder_ind

## Model
## Model: "model"
## ________________________________________________________________________________
## Layer (type)              Output Shape      Param #  Connected to               
## ================================================================================
## input (InputLayer)        [(None, 30)]      0                                   
## ________________________________________________________________________________
## hidden_1 (Dense)          (None, 16)        496      input[0][0]                
## ________________________________________________________________________________
## hidden_2 (Dense)          (None, 8)         136      hidden_1[0][0]             
## ________________________________________________________________________________
## z_mean (Dense)            (None, 3)         27       hidden_2[0][0]             
## ________________________________________________________________________________
## z_log_var (Dense)         (None, 3)         27       hidden_2[0][0]             
## ________________________________________________________________________________
## z (Concatenate)           (None, 6)         0        z_mean[0][0]               
##                                                      z_log_var[0][0]            
## ________________________________________________________________________________
## lambda (Lambda)           (None, 3)         0        z[0][0]                    
## ================================================================================
## Total params: 686
## Trainable params: 686
## Non-trainable params: 0
## ________________________________________________________________________________

decoder_ind

## Model
## Model: "model_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## latent_inputs (InputLayer)          [(None, 3)]                     0           
## ________________________________________________________________________________
## vae_out (Dense)                     (None, 30)                      120         
## ================================================================================
## Total params: 120
## Trainable params: 120
## Non-trainable params: 0
## ________________________________________________________________________________

## Model
## Model: "model_2"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## input (InputLayer)                  [(None, 30)]                    0           
## ________________________________________________________________________________
## model (Model)                       [(None, 3), (None, 3), (None, 3 686         
## ________________________________________________________________________________
## model_1 (Model)                     (None, 30)                      120         
## ================================================================================
## Total params: 806
## Trainable params: 806
## Non-trainable params: 0
## ________________________________________________________________________________

Training

inputs to train_model() include a compiled Keras model, which needs to have autoencoder or VAE structure where the input and output layers are the same dimension. This can be obtained from the third item returned from build_vae_independent() or build_vae_correlated().
The argument train_data is a num_train by num_items binary matrix, where entry \((j, i) = 1\) if student \(j\) answered item \(i\) correct, and 0 otherwise.
The number of loops over the training data is set by num_epochs.
The default batch_size is set to be 1 (pure stochastic gradient descent), but can be any integer less than or equal to num_train.

# Training parameters
num_train <- floor(0.8 * num_students)
num_test <- num_students - num_train 
data_train <- data[1:num_train,]
data_test <- data[(num_train+1):num_students,]
num_epochs <- 30
batch_size <- 16

Model 1 Training Loss

Get parameter estimates for Model 1 after training

The function get_item_parameter_estimates() looks at all trainable parameters of the decoder part of the VAE and returns the values which serve as estimates to the item paramters.

Only one argument is required, the decoder, which takes in a Keras model. This can be obtained from the second item returned by build_vae_independent() or build_vae_correlated().
model_type defaults to the 2-parameter logistic model, but can be set to 1 for the 1-parameter logistic model.

get_item_parameter_estimates() returns the output layer biases and the weights connected to the output layer, which serve as estimates to difficulty and discrimination paramters, respectively.
model_type = 1, then only the output layer biases (difficulty parameter estimates) are returned.

Converting Response Sets into Ability Parameter Estimates

The encoder of a trained ML2P-VAE model maps student responses to their estimates of their ability parameter.

Load in true values (included in this pacakge)

Examine Model 1 estimates

build_vae_independent(), and assumes that the latent traits are independent of one another. -build_vae_independent() can be useful if the true correlations between latent traits are not known, i.e. correlation_matrix is not given.

Examine Model 1 estimates(continue…)

Correlation plots of discrimination parameter estimates for data set (i) with 30 items and 3 latent traits.
Each color represents discrimination parameters relating one of the 3 latent skills.

Corr latent traits model

use TensorFlow probability library(this is not working in my computer)

Corr latent traits

VAE/deep learning frontiers in educational assessment

Autoencoders for Educational Assessment

https://doi.org/10.1007/978-3-030-23207-8_8

Table 1 shows that for parameter recovery, the VAE has much smaller error terms and much higher correlations.
Result is corroborated by the correlation plots between the true discrimination of parameters and the weights of the decoder.

VAE for Cognitive Models

DOI:10.1109/IJCNN.2019.8852333

VAE model is proposed for educational cognitive assessment
Established relationship with famous IRT models and VAE

Extended classic IRT response models with deep neural network components/VAE

Feb 2020 study by Stanford

https://arxiv.org/abs/2002.00276

Variational inference provides a potential solution, running orders of magnitude faster than MCMC algorithms while matching their state-of-the-art accuracy.

this approach allows to natural extend IRT with non-linearities modeled via deep neural networks.

Bayesian inference algorithm for IRT, and show that it is fast and scalable without sacrificing accuracy

Incorporate a Q-matrix into an existing MIRT model

DOI: 10.1177/0013164418814898

Introduced how to incorporate a Q-matrix into an existing MIRT model, and demonstrated benefits of the proposed hybrid model through two simulation studies and an applied study.

! .

VAE in the estimation of MIRT models

https://doi.org/10.1007/s10994-021-06005-7

proposed an alternative network architecture and regularization term in the loss function, which allows for correlated latent skills.

VAE can be fit to a more general multivariate Gaussian prior distribution with a non-identity correlation matrix.

experiments on four different datasets, both real and simulated, which are much larger than those used in Curi et al. (2019) demonstrating the ability of the proposed method to handle high-dimensional data.
compare results with those of traditional IRT parameter estimation methods, displaying the considerable

VAE to estimate item parameters and correlated latent abilities, and directly compare the ML2P-VAE method to more traditional parameter estimation methods, e.g., Monte Carlo expectation-maximization.

Other thoughts/comments

This was a very basic example, and results can be improved by fine-tuning the network architecture.
Though real-world datasets likely won’t have “true” values of the parameters available, simulated dataset includes them for reference.

Future directions

In real applications, it is unlikely that the exact correlations between latent abilities are available, so an approximate covariance matrix would need to be used instead.
Use of other distribution in the VAE model e.g., exponential families
The experiments of converse et al(2021) work imply that convergence likely relies on knowledge of an accurate covariance matrix among latent traits, thus it is important to develop better methods of estimating covariance matrix.

Compare ML2P-VAE methods with such sampling based parameter estimation approaches e.g.,Gibbs sampling, Hamiltonian Monte Carlo, and Metropolis-Hastings,the Robbins-Monroe algorithm algorithms in the future.

Extend ML2P-VAE method to ML3P-VAE model.

VAE for Educational Assessment

LDI leaders concretae task talk

Overview

which face is fake?

Data

Unsupervised Learning

Genarative Model

Why genarative model?

What is latent variable?

Auto encoder

why do we care about low dimentional Z?

What is VAE?

VAE optimization

Reparametrizing the sampling lyer

VAE summery

VAE for Educatiuonal Assessment

VAE for Educatiuonal Assessment(continue…)

What is Q-matrix and What can we do with Q-matrix?

Q-matrix & how to use it in VAE model?

What is difficulty and discrimination perameter?

Multidimensional item response theory(MIRT)

MIRT

Multidimensional Logistic 2-Parameter (ML2P) Model

ML2P-VAE model

Load data

Model parameters

Model 1: An ML2P-VAE model assuming latent traits are independent

decoder_ind

Training

Model 1 Training Loss

Get parameter estimates for Model 1 after training

Converting Response Sets into Ability Parameter Estimates

Load in true values (included in this pacakge)

Examine Model 1 estimates

Examine Model 1 estimates(continue…)

Corr latent traits model

VAE/deep learning frontiers in educational assessment

Autoencoders for Educational Assessment

https://doi.org/10.1007/978-3-030-23207-8_8

VAE for Cognitive Models

DOI:10.1109/IJCNN.2019.8852333

Extended classic IRT response models with deep neural network components/VAE

Feb 2020 study by Stanford

https://arxiv.org/abs/2002.00276

Incorporate a Q-matrix into an existing MIRT model

DOI: 10.1177/0013164418814898

VAE in the estimation of MIRT models

https://doi.org/10.1007/s10994-021-06005-7

Other thoughts/comments

Future directions

why do we care about low dimentional `Z`?