Q-matrix and variational autoencoders to estimate multidimensional item response theory models with correlated and independent latent variables
The Western Tennessee Chapter of the American Statistical Association:Annual Fall Chapter Symposium 
Mahbubul Hasan(presenter), Faculty Mentors & Team Members- Dale Bowman, Lih Y Deng, Ching-Chi Yang, Jhon Sabatini, and John Hollander.
11/04/2021
Plan for this lecture
- Research Questions
- Data description
- Method:Q-matrix misspecification
- Generative model, why generative model?
- Autoencoder, Variational Autoencoder(VAE)
- Q-matrix & how to use it in VAE model?
- Multidimensional Logistic 2-Parameter (ML2P) Model
- VAE/deep learning frontiers in educational assessment
- Future directions or extensions
Which face is fake?
[Image:MITS191]
Research Questions
- How to use variational autoencoder, in the estimation of MIRT models with large numbers of correlated and independent latent traits? How is the error measure looks like when changes were made in latent traits and the number of students?
- How are the effects of various factors such as sample sizes, percentage of misfit items in the test, and item quality (item discrimination) on item and model fit in case of misspecification of Q matrix?
Data
- Two small sizes:
- 6-traits, 35 items and 10,000 students
- 6 traits, 40 items and 18000 students
- Two large sample sizes
- 6 traits and 50 items and 25,000 students
- 20 traits 200 items and 60,000 students
- One real data
- Reading Inventory and Scholastic Evaluation(RISE): 3 traits, 32 items and 4100 students
Data
- ML2Pvae(Variational Autoencoder Models for IRT Parameter Estimation) R package(simulated)
- N= students, found in the data frame
theta_true, were sampled from \(N(0,Σ)\).
- \(Σ\) specifies the correlations between the abilities, and is found in the data frame
correlation_matrix.
Discrimination: \([0.25, 1.75]\), difficulty paramters:\([−3, 3]\), Q-matrix:\(Bern(0.2)\). sampled uniformly.
- Probabilities for each student answering each question correctly were calculated with the
ML2P model(McKinley and Reckase,1980).
Method
- Q-matrix Misspecification (Table 1: Correctly specified Q matrix, misspecified Q matrix, and misfit items. For instance, if i use 200 items when 20% misfit= 40 items and when 40% misfit= 80 items(This example from Sunbul et. al. 2018 for 15 items)

Research Model
(Research model: 1. low-high dimensional data from population, 2. VAE architecture, 3. synthesize data., 4. incorporate q-matrix,5. model parameter estimation) 
DataFrame(head)
- data(response)

- head(q-matrix)

Unsupervised learning
- Data: Y is the data, no labels exist.
- Objective: to explore and learn hidden and underlying structure of the data
- Example: dimensionality reduction, clustering
Generative model
- generative means once we learn the underlying structure of the data we can create an new example that does not exist
- Goal:Take as input training samples from some distribution and learn a model that represents that distribution
- The generative modeling process
[O.Reilly,2021]
Why generative model?
- Facial classification, Debiasing
- Able to uncover underlying latent variables in the dataset

- Outliers detection
- detect outliers in the distribution and use outliers during training to improve even more!

[Amini,2019]
What is latent variable?
- Myth of the cave-

- Shadows are the observed variable and latent variables are underlying object generating shadows
- latent variables in this case(response data) are the knowledge components needed to solve the problem(e.g. questions)
What is autoencoder?
- Autoencoders are a special class of neural networks in which the input and output layers are the same.
- An AE consists of two neural networks: an encoder and a decoder. The encoder takes the high-dimensional input \(x\), and feeds forward through one or more hidden layers to some latent space.
- The decoder takes in this **latent dimension, and feeds forward through hidden layers to output a reconstruction of the original input, \(\hat{X}\).
- AE model

[Rocca,2019]
Why do we care about low dimentional Z?
- Lower dimensional feature representation from unlabeled data
- Reduce noise, use most meaningful features from distribution
- train the model to use features to reconstruct the original data.
- bottleneck hidden layers
What is VAE?
- VAE is probabilistic twist on autoencoder.
- VAE still consists of an encoder and decoder, but the latent space returned from the encoder is trained to learn a probability distribution for \(\Theta\) given \(X\).
- VAE allows the network to map the training data to a (latent) normal distribution.
- sample from this distribution, and feed forward that sample through the decoder
- VAE model(MNIST data)

VAE optimization
- Training VAE model
[Doersch, 2016]
- Left is without the “reparameterization trick”, and right is with it.
- Red shows sampling operations that are non-differentiable. Blue shows loss layers.
- The feed forward behavior of these networks is identical, but backpropagation can be applied only to the right network.
Reparametrizing the sampling lyer
- Reparametrizing
[MIT-S191]
- Enables to train VAE end to end by backpropagation with respect to
Z and actual weights of the encoder.
VAE: latent purturbation during training
- purturbation: a small change in a system which can be as a result of a third object interacting with the system
- slowly increase or decrease the latent variable, keep other variales fixed.
- Disentanglement

VAE summary
- Compress representation of information to something we can use to learn
- Reconstruction allows for unsupervised learning(one label in data no Y)
- Reparametrizing trick to train end to end
- Use perturbation to interpret hidden lyers
- Generate new set of data/example.
What is Q-matrix and What can we do with Q-matrix?
- Shows the relationship between test items and latent or underlying attributes, or concepts.
- Q-matrix is a MxN matrix, where M equals the number of questions in an assessment, and N equals the total number of concepts required for understanding all questions.

- Q-matrix can be used for understanding students’ performance
- by assigning the closest ideal response to a student’s response vector, it can be assumed which concepts the student does, and which s/he does not know.
Q-matrix & how to use it in VAE model?
- AE and VAE architectures are combined with the ML2P model in the following ways-
- no hidden layer in the decoder, a sigmoidal activation function on the output layer nodes (with non-negative weights).
Q-matrix to determine the connections between the latent traits and the output items.
- VAE model
[Converse et al.,2021]
What is difficulty and discrimination perameter?
- Reflects students varied levels of understanding
- difficulty

- Difficulty parameters can be any number, but are usually between [-3,3]
Discrimination parameters
- be fair- students’ with better understanding should do better in test
- discrimination

- Discrimination parameters are always positive, usually between [0.25, 3]
Multidimensional item response theory(MIRT)
- naive approach to quantifying student knowledge is to look at the percentage of questions that the student answered correctly
- this does not take into account the fact that each item on the assessment is different-both in difficulty and in content
For example, if one student answers only questions 1 and 4 incorrect, and another student answers only questions 3 and 7 incorrect, they have the same percentage score. But it is not likely that the two students share the same latent trait values. Questions 3 and 7 may have tested a different skill than items 1 and 4, and could vary greatly in difficulty level.
- in this place MIRT models is useful and play major role in educational measurement.
Multidimensional Logistic 2-Parameter (ML2P) model
- ML2P model gives the probability of students answering a particular question as a continuous function of student ability (McKinley and Reckase, 1980).
- ML2p model: \(P(u_{ij}= 1\mid \Theta_j;a_i,b_i)=\frac{1}{1+exp\Big[-\sum_{k=1}^k a_{ji}\theta_{jk} + b_i\Big]}\)
- difficulty parameter
bi for item i,
- discrimination parameter \(a_{ik}≥0\) for each latent trait
k quantifying the level of ability k required to answer item i correctly.
- ML2P model gives the probability of student
j with latent abilities \(𝛩_j = (𝜃_{j_1},...,𝜃_{jK})^⊤\) answering item i correctly as
ML2P-VAE model
- no hidden layers in the decoder. Instead, the non-zero weights in the decoder are determined by a given Q-matrix
- next, connect the encoded distribution layer directly to the output layer. The output layer must use the sigmoidal activation function-
\(Q(zi) = \frac{1}{1+e^{-z_i}}\)
- VAE model
[Converse et al.,2021]
- We could try inputting non-binary responses (partial credit) into the ML2P-VAE model, but it wouldn’t have as much theory backing it. The Samejima graded response model would be a good fit
Model parameters
num_items and num_skills describe the assessment length and the number of abilities being evaluated by the assessment.
Q_matrix is a num_skills by num_items matrix which specifies the relationship between items and abilities.
Get parameter estimates for Model after training
get_item_parameter_estimates() all trainable parameters of the decoder part of the VAE and returns the values which serve as estimates to the item parameters.
Load in true values (included in this pacakge)
#disc_true <- as.matrix(disc_true)
#diff_true <- as.matrix(diff_true)
#theta_true<- as.matrix(theta_true)
Assumes latent traits are independent
(Small sample: ML2P-VAE parameter estimates for data set with 35 items and 6 independent latent traits and 10000 student.Each color corresponds to discrimination parameters related to one of the 6 latent traits. for difficulty parameter by itself with 35 items)
Assumes latent traits are independent(continue…)
(Small sample: ML2P-VAE parameter estimates for data set with 40 items and 6 independent latent traits and 18,000 student.Each color corresponds to discrimination parameters related to one of the 6 latent traits. For difficulty parameter by itself with 40 items)
Assumes latent traits are independent(continue…)
(Large sample: ML2P-VAE parameter estimates for data set with 50 items and 6 independent latent traits and 25,000 student.Each color corresponds to discrimination parameters related to one of the 6 latent traits. For difficulty parameter by itself with 50 items)
Assumes latent traits are independent(continue…)
(Large sample: ML2P-VAE parameter estimates for data set with 200 items and 20 independent latent traits and 60,000 student.Each color corresponds to discrimination parameters related to one of the 20 latent traits. For difficulty parameter by itself with 200 items)
Q-matrix Misspacification/Validation
(Small sample: Q-matrix Misspacification plots comparison with specified q-matrix vs under, over and mixed method for both 20 percent and 40 percent changes of items. Figure shows how data point goes below 45 degree once different method of misspacifications are applied. RMSE and Cor score also confirms this changes) 
Q-matrix Misspacification/Validation(continue…)
(Large sample: Q-matrix Misspacification plots comparison with specified q-matrix vs under, over and mixed method for both 20 percent and 40 percent changes of items. Figure shows how data point goes below 45 degree once different method of misspacifications are applied. RMSE and Cor score also confirms this changes)
Q-matrix Misspacification/Validation(continue…)
(Error measures for ability (theta) parameters from various parameter estimation methods on two different data sets. Table shows how RMSE and BIAS and Cor score changes as we misfit q-matrix. other methods are kept blank intentionally) 
Corr latent traits model
- most effective when parameters specify a more complicated distribution, such as \(N (μ, Σ) or N (0, Σ)\)
- use
TensorFlow probability library
VAE for educational assessment(summary)
- ML2P-VAE methods are most useful on high-dimensional data
- Faster in runtime
- Reserve data privacy
- Students’ latent abilities estimation are often more accurate when using ML2P-VAE methods.
- This is especially interesting, as the estimates \(𝛩_j\) are not updated in the iterations of a gradient descent algorithm, while the estimates to \(a_{ik}\) and \(b_i\) are.
- Highly interpretable model.
- During estimation of difficulty parameters, the improvement gained from using traditional methods is incredibly small. [Converse et. al. 2021]
Future directions
- Data anonymization via VAE to protect learners privacy.
- Use of other distribution in the VAE model e.g., exponential families
- Compare ML2P-VAE methods with such sampling based parameter estimation approaches e.g.,Gibbs sampling, Hamiltonian Monte Carlo, and Metropolis-Hastings,the Robbins-Monroe algorithm algorithms in the future.
- Extend ML2P-VAE method to ML3P-VAE model.