Ordinal model of text quality

Jens Roeser

Compiled Oct 05 2022

1 Modelling text quality as a function of writing process measures

Text quality is assessed by a set of rubrics with each consisting of ordinal data ranging from 1 (low quality) to 5 (high quality). As there are different rubrics, the final model needs to be multivariate but for simplicity, we will just focus on one univariate distribution of text quality ratings.

Text quality data will be modelled using a cumulative normal distribution (see Bürkner & Vuorre, 2019). In cumulative models of ordinal data we assume that the observed ordinal variable \(Y\), the text quality rating, originates from the categorization of a latent (not observable) continuous variable \(\tilde{Y}_\text{i}\), i.e. the latent quality of the text in a given rubric, of participant \(i \in \{1,\dots, I\}\) where \(I\) is the total number of participants (i.e. writers).

To model the categorization process, the model assumes that there are \(K\) thresholds \(\tau_\text{k}\) which separates \(\tilde{Y}\) into \(K+1\) observable, ordered categories of \(Y\). If we have 5 rating levels, there are 4 thresholds. These thresholds \(\tau_\text{k}\) replace the intercept. Thus, we model the probabilities of \(Y\) being equal to category \(k\) given a linear predictor \(\eta\). We assume the latent variable \(\tilde{Y}\) to be normally distributed. We call the corresponding cumulative normal distribution function \(\Phi\). Then, the probability for each text quality rating \(Y\) being equal to category \(k\) can be calculated as in equation (1.1).

\[ \begin{equation}\tag{1.1} \text{P}(Y=k\mid\eta) = \Phi(\tau_\text{k}-\eta) - \Phi(\tau_\text{k-1}-\eta). \end{equation} \] The linear regression for the text’s latent quality \(\tilde{Y}\) can be formulated as in equation (1.2).

\[ \begin{equation}\tag{1.2} \tilde{Y}_\text{i} = \eta_\text{i} + \epsilon \end{equation} \]

\(\tilde{Y}\) consists then of two parts of which \(\eta\) represents variation explained by the predictors, and \(\epsilon\) represents unexplained variation.

Our linear model \(\eta\) for predicting text quality ratings is defined as in equation (1.3).

\[ \begin{equation}\tag{1.3} \eta_\text{i} = \sum_\text{p=1}^\text{P} \beta_\text{p} \cdot \text{process}_\text{pi} + u_\text{i} \end{equation} \]

\(P\) is the total number of writing process measures \(p \in \{1,\dots,P\}\) included in the model to predict text quality ratings such as the time to writing start, pause frequency, revisions, etc. \(\beta_\text{p}\) captures the relationship between the \(p^{\text{th}}\) writing process measure and text quality. \(u_\text{i}\) captures the the variation in text quality ratings across participants; this variation is assumed to be normally distributed around a mean of 0 and a standard deviaion of \(\sigma_{u}^2\) as shown in equation (1.4).

\[ \begin{equation}\tag{1.4} u_\text{i} \sim \mathcal{N}(0, \sigma_{u}^2) \end{equation} \]

2 Predicting text quality on the basis of incrementally obtained writing process estimates

The posterior parameter estimates of the model described previously will then be used to calculate the probability of text quality scores on the basis of the incrementally updated posterior distributions of all writing process measures.

# Simulation for ordinal outcome variable with 5 response categories
# corresponding to text quality.
# Returns the probability of each response category, whether text quality criterion is more likely to be 1, 2, 3, 4, or 5.

# Posterior from population level mixed-effects model. This
# needs to be an ordinal model for text quality as outcome variable
# and process measures as predictors (random effects for participants).

# Text quality models probably needs to be multivariate for different rating criteria.
# Predictors needs to be by-ppt (text) MAPs from individual probability
# models of the process measures.

n_posterior <- 1000
b_low_sd <- rnorm(n = n_posterior, mean = 2, sd = 1) 
b_high_sd <- rnorm(n = n_posterior, mean = 2,  sd = 5)

# Thresholds
b01 <- rnorm(n = n_posterior, mean = 2, sd = .1)
b02 <- rnorm(n = n_posterior, mean = 0.05, sd = .1)
b03 <- rnorm(n = n_posterior, mean = -0.05, sd = .1)
b04 <- rnorm(n = n_posterior, mean = -2, sd = .1)

# Incoming writing process data (slope for text quality prediction)
# from incremental real time model
n <- 100
x_1 <- rnorm(n = n, mean = 0, sd = .1)
x_2 <- rnorm(n = n, mean = 0, sd = .1)

# How do they need to be multiplied?
logodds1 <- colMeans(b01 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds2 <- colMeans(b02 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds3 <- colMeans(b03 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds4 <- colMeans(b04 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))

prob_2to5 <- inv_logit(logodds1)
prob_3to5 <- inv_logit(logodds2)
prob_4to5 <- inv_logit(logodds3)
prob_5 <- inv_logit(logodds4)

prob_1 <- 1 - prob_2to5
prob_2 <- prob_2to5 - prob_3to5
prob_3 <- prob_3to5 - prob_4to5
prob_4 <- prob_4to5 - prob_5

y <- c()
for (i in 1:n) {
  y[i] <- sample(x = c(1:5), size = 1, 
    prob = c(prob_1[i], prob_2[i], prob_3[i], prob_4[i], prob_5[i])
  )
}

simdat <- tibble(x_1, x_2, y)
library(ordinal)
model <- clm(factor(y) ~ x_1 + x_2, data = simdat)
summary(model)

formula: factor(y) ~ x_1 + x_2
data:    simdat

 link  threshold nobs logLik  AIC    niter max.grad cond.H 
 logit flexible  100  -133.95 279.89 6(2)  4.78e-08 2.4e+03

Coefficients:
    Estimate Std. Error z value Pr(>|z|)
x_1   0.4646     1.9464   0.239    0.811
x_2   0.9059     1.8857   0.480    0.631

Threshold coefficients:
    Estimate Std. Error z value
1|2  -1.7590     0.2850  -6.171
2|3   0.0588     0.2045   0.287
3|4   0.1396     0.2047   0.682
4|5   2.0766     0.3207   6.476

References

Bürkner, P.-C., & Vuorre, M. (2019). Ordinal regression models in psychology: A tutorial. Advances in Methods and Practices in Psychological Science, 2(1), 77–101.