1 Modelling text quality as a function of writing process measures
Text quality is assessed by a set of rubrics with each consisting of ordinal data ranging from 1 (low quality) to 5 (high quality). As there are different rubrics, the final model needs to be multivariate but for simplicity, we will just focus on one univariate distribution of text quality ratings.
Text quality data will be modelled using a cumulative normal distribution (see Bürkner & Vuorre, 2019). In cumulative models of ordinal data we assume that the observed ordinal variable \(Y\), the text quality rating, originates from the categorization of a latent (not observable) continuous variable \(\tilde{Y}_\text{i}\), i.e. the latent quality of the text in a given rubric, of participant \(i \in \{1,\dots, I\}\) where \(I\) is the total number of participants (i.e. writers).
To model the categorization process, the model assumes that there are \(K\) thresholds \(\tau_\text{k}\) which separates \(\tilde{Y}\) into \(K+1\) observable, ordered categories of \(Y\). If we have 5 rating levels, there are 4 thresholds. These thresholds \(\tau_\text{k}\) replace the intercept. Thus, we model the probabilities of \(Y\) being equal to category \(k\) given a linear predictor \(\eta\). We assume the latent variable \(\tilde{Y}\) to be normally distributed. We call the corresponding cumulative normal distribution function \(\Phi\). Then, the probability for each text quality rating \(Y\) being equal to category \(k\) can be calculated as in equation (1.1).
\[ \begin{equation}\tag{1.1} \text{P}(Y=k\mid\eta) = \Phi(\tau_\text{k}-\eta) - \Phi(\tau_\text{k-1}-\eta). \end{equation} \] The linear regression for the text’s latent quality \(\tilde{Y}\) can be formulated as in equation (1.2).
\[ \begin{equation}\tag{1.2} \tilde{Y}_\text{i} = \eta_\text{i} + \epsilon \end{equation} \]
\(\tilde{Y}\) consists then of two parts of which \(\eta\) represents variation explained by the predictors, and \(\epsilon\) represents unexplained variation.
Our linear model \(\eta\) for predicting text quality ratings is defined as in equation (1.3).
\[ \begin{equation}\tag{1.3} \eta_\text{i} = \sum_\text{p=1}^\text{P} \beta_\text{p} \cdot \text{process}_\text{pi} + u_\text{i} \end{equation} \]
\(P\) is the total number of writing process measures \(p \in \{1,\dots,P\}\) included in the model to predict text quality ratings such as the time to writing start, pause frequency, revisions, etc. \(\beta_\text{p}\) captures the relationship between the \(p^{\text{th}}\) writing process measure and text quality. \(u_\text{i}\) captures the the variation in text quality ratings across participants; this variation is assumed to be normally distributed around a mean of 0 and a standard deviaion of \(\sigma_{u}^2\) as shown in equation (1.4).
\[ \begin{equation}\tag{1.4} u_\text{i} \sim \mathcal{N}(0, \sigma_{u}^2) \end{equation} \]
2 Predicting text quality on the basis of incrementally obtained writing process estimates
The posterior parameter estimates of the model described previously will then be used to calculate the probability of text quality scores on the basis of the incrementally updated posterior distributions of all writing process measures.
# Simulation for ordinal outcome variable with 5 response categories
# corresponding to text quality.
# Returns the probability of each response category, whether text quality criterion is more likely to be 1, 2, 3, 4, or 5.
# Posterior from population level mixed-effects model. This
# needs to be an ordinal model for text quality as outcome variable
# and process measures as predictors (random effects for participants).
# Text quality models probably needs to be multivariate for different rating criteria.
# Predictors needs to be by-ppt (text) MAPs from individual probability
# models of the process measures.
<- 1000
n_posterior <- rnorm(n = n_posterior, mean = 2, sd = 1)
b_low_sd <- rnorm(n = n_posterior, mean = 2, sd = 5)
b_high_sd
# Thresholds
<- rnorm(n = n_posterior, mean = 2, sd = .1)
b01 <- rnorm(n = n_posterior, mean = 0.05, sd = .1)
b02 <- rnorm(n = n_posterior, mean = -0.05, sd = .1)
b03 <- rnorm(n = n_posterior, mean = -2, sd = .1)
b04
# Incoming writing process data (slope for text quality prediction)
# from incremental real time model
<- 100
n <- rnorm(n = n, mean = 0, sd = .1)
x_1 <- rnorm(n = n, mean = 0, sd = .1)
x_2
# How do they need to be multiplied?
<- colMeans(b01 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds1 <- colMeans(b02 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds2 <- colMeans(b03 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds3 <- colMeans(b04 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds4
<- inv_logit(logodds1)
prob_2to5 <- inv_logit(logodds2)
prob_3to5 <- inv_logit(logodds3)
prob_4to5 <- inv_logit(logodds4)
prob_5
<- 1 - prob_2to5
prob_1 <- prob_2to5 - prob_3to5
prob_2 <- prob_3to5 - prob_4to5
prob_3 <- prob_4to5 - prob_5
prob_4
<- c()
y for (i in 1:n) {
<- sample(x = c(1:5), size = 1,
y[i] prob = c(prob_1[i], prob_2[i], prob_3[i], prob_4[i], prob_5[i])
)
}
<- tibble(x_1, x_2, y)
simdat library(ordinal)
<- clm(factor(y) ~ x_1 + x_2, data = simdat)
model summary(model)
formula: factor(y) ~ x_1 + x_2
data: simdat
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 100 -133.95 279.89 6(2) 4.78e-08 2.4e+03
Coefficients:
Estimate Std. Error z value Pr(>|z|)
x_1 0.4646 1.9464 0.239 0.811
x_2 0.9059 1.8857 0.480 0.631
Threshold coefficients:
Estimate Std. Error z value
1|2 -1.7590 0.2850 -6.171
2|3 0.0588 0.2045 0.287
3|4 0.1396 0.2047 0.682
4|5 2.0766 0.3207 6.476