Ordinal model of text quality

Jens Roeser

Compiled Oct 06 2022

1 Modelling text quality as a function of writing process measures

The quality rating assigned to a particular text can be assumed to be the result of the processes that the writer is engaging with during the production of the text (e.g. planning, pausing, reading, revision) at the certain moments in time during the production (planning and drafting comes, ideally, before revision). For examples that link writing processes to text quality see López et al. (2019) and Breetvelt et al. (1994).

The quality of the text is assessed by a set of rubrics with each consisting of ordinal data ranging from 1 (low quality) to 5 (high quality). As there are different rubrics, the model, eventually, needs to be multivariate but for simplicity, we will focus on one univariate distribution of text quality ratings, which is akin to one rubric.

We will model text quality using a cumulative normal distribution for ordinal data (see Bürkner & Vuorre, 2019). Cumulative models assume that the observed ordinal variable \(Y\), the text quality rating, originates from the categorization of a latent (not observable) continuous variable \(\tilde{Y}_\text{i}\), i.e. the latent quality of the text in a given rubric, of participant \(i\) for \(i \in \{1,\dots, I\}\) where \(I\) is the total number of participants (i.e. writers) for a particular text. We map the discrete observed data onto a continuous underlying scale.

To model the process of categorising texts (i.e. assigning a quality rating), the model assumes that there are \(K\) thresholds \(\tau_\text{k}\) which separates \(\tilde{Y}\) into \(K+1\) observable, ordered categories of \(Y\). If the rating variable has 5 levels, there are 4 thresholds. These thresholds \(\tau_\text{k}\) replace the intercept. Thus, we model the probabilities of \(Y\) being equal to category \(k\) given a linear predictor \(\eta\). For simplicity, we assume the latent variable \(\tilde{Y}\) to be normally distributed. We call the corresponding cumulative normal distribution function \(\Phi\). The probability for each text quality rating \(Y\) being equal to category \(k\) can be determined by equation (1.1).

\[ \begin{equation}\tag{1.1} \text{P}(Y=k\mid\eta) = \Phi(\tau_\text{k}-\eta) - \Phi(\tau_\text{k-1}-\eta). \end{equation} \]

The linear regression for the text’s latent quality \(\tilde{Y}\) can be formulated as equation (1.2).

\[ \begin{equation}\tag{1.2} \tilde{Y}_\text{ik} = \eta_\text{it} + \text{U}_\text{it} + \epsilon_\text{it} \end{equation} \]

\(\tilde{Y}\) consists then of three parts: \(\eta\) represents the variation explained by fixed effects, \(\text{U}\) are random effects, and \(\epsilon\) represents the unexplained variation.

Our linear model \(\eta\) for predicting text quality ratings is defined in equation (1.3).

\[ \begin{equation}\tag{1.3} \eta_\text{it} = \sum_\text{t = 1}^\text{T}\beta_\text{1t}\cdot \text{time}_\text{ij}^\text{t} + \sum_\text{p=1}^\text{P} \beta_\text{2p} \cdot \text{process}_\text{pij}+ \sum_\text{t=1}^\text{T} \sum_\text{p = 1}^\text{P} \beta_\text{3tp} \cdot \text{process}_\text{pij} \cdot \text{time}_\text{ij}^\text{t} \end{equation} \]

where \(\text{T}\) is the number of polynomials \(t \in \{1,\dots,T\}\) to capture the moment-by-moment variation in the explanatory variables where \(\text{time}_\text{ij}\) defines the \(j^\text{th}\) moment in the writing process of writer \(i\), i.e. the moment in time during the composition process since writing onset proportional to total time. \(\text{time}\) raised to the power of \(\text{t}\) is summed up to allow for deviations from a linear relationship between time and text quality. The implemented model will be using orthogonal polynomials which is omitted here for simplicity. The coefficient \(\beta_\text{1t}\) captures the posterior distribution of the \(t^{\text{th}}\) order polynomial term of the predictor \(\text{time}\). \(P\) is the total number of writing process measures included in the model to predict text quality ratings \(p \in \{1,\dots,P\}\) of the \(i^\text{th}\) participant at the \(j^\text{th}\) moment in time. \(\beta_\text{2p}\) captures the relationship between the \(p^{\text{th}}\) writing process measure and text quality. The interaction of each process measure \(p\) and the \(t^{\text{th}}\) order polynomial of \(\text{time}\) are captured by \(\beta_\text{3tp}\).

Equation (1.4) summarizes how by-participant variation is assumed to affect the text quality ratings.

\[ \begin{equation}\tag{1.4} \text{U}_\text{it} = u_\text{0i} + \sum_\text{t = 1}^\text{T}u_\text{1ti}\cdot \text{time}_\text{ij}^\text{t} + \sum_\text{p=1}^\text{P} u_\text{2ip} \cdot \text{process}_\text{pij} + \\ \sum_\text{t=1}^\text{T} \sum_\text{p = 1}^\text{P} u_\text{3tpi} \cdot \text{process}_\text{pij} \cdot \text{time}_\text{ij}^\text{t} \end{equation} \]

Different writers produce different quality texts, some better and some worse compared to the average. Also, every time a writer is producing a text, the text’s quality will vary. We model the individual variation in quality scores as random quality score intercept \(u_\text{m}\) which are assumed to distributed around 0 according to a multivariate normal distribution as shown in equation (1.5).

\[ \begin{equation}\tag{1.5} \begin{pmatrix} u_\text{0} \\ u_\text{1} \\ \vdots \\ u_\text{m} \\ \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} 0\\ 0\\ \vdots\\ 0 \end{pmatrix} , \Sigma_{u} \right) \end{equation} \] where \(m \in \{0,\dots,M\}\) and \(M\) is the number of predictor variables that are allowed to vary across participants. As the individual’s average text quality score might impact to what extent predictor variables affect text quality ratings, we need to add the variance-covariance matrix \(\Sigma_u\) defined as in (1.6) (see Sorensen et al., 2016). The variance-covariance matrix captures the correlations between predictor variables and they variability across writers. For example, pausing at certain points during the writing process might have a stronger effect on participants that generally produce weaker texts but only a subtle effect on participants that produce strong texts. The variance-covariance matrix allows to capture the correlation between the variances as \(\rho_u\).

\[ \begin{equation}\tag{1.6} \Sigma_{u} = \begin{pmatrix} \sigma_{u0}^2 & \rho_u\sigma_{u0}\sigma_{u1} & \dots & \rho_u\sigma_{u0}\sigma_{um} \\ \rho_u\sigma_{u0}\sigma_{u1} & \sigma_{u1}^2 & \dots & \rho_u\sigma_{u1}\sigma_{um}\\ \vdots & \vdots & \ddots & \vdots\\ u\rho_u\sigma_{u0}\sigma_{um} & u\rho_u\sigma_{u1}\sigma_{um} & \dots & \sigma_{um}^2\\ \end{pmatrix} \end{equation} \] The R package brms will be used to fit the cumulative multivariate mixed effects model (Bürkner, 2017, 2018; Bürkner & Vuorre, 2019). A Bayesian model provides access to posterior probability distributions of every model parameter (Farrell & Lewandowsky, 2018), in a situation where a frequentist model is likely to show convergence issues (Bates et al., 2015).

2 Predicting text quality on the basis of incrementally obtained writing process estimates

The posterior parameter estimates of the model described previously will then be used to calculate the probability of text quality scores on the basis of the incrementally updated posterior distributions of all writing process measures.

# Simulation for ordinal outcome variable with 5 response categories
# corresponding to text quality.
# Returns the probability of each response category, whether text quality criterion is more likely to be 1, 2, 3, 4, or 5.

# Posterior from population level mixed-effects model. This
# needs to be an ordinal model for text quality as outcome variable
# and process measures as predictors (random effects for participants).

# Text quality models probably needs to be multivariate for different rating criteria.
# Predictors needs to be by-ppt (text) MAPs from individual probability
# models of the process measures.

n_posterior <- 1000
b_low_sd <- rnorm(n = n_posterior, mean = 2, sd = 1) 
b_high_sd <- rnorm(n = n_posterior, mean = 2,  sd = 5)

# Thresholds
b01 <- rnorm(n = n_posterior, mean = 2, sd = .1)
b02 <- rnorm(n = n_posterior, mean = 0.05, sd = .1)
b03 <- rnorm(n = n_posterior, mean = -0.05, sd = .1)
b04 <- rnorm(n = n_posterior, mean = -2, sd = .1)

# Incoming writing process data (slope for text quality prediction)
# from incremental real time model
n <- 100
x_1 <- rnorm(n = n, mean = 0, sd = .1)
x_2 <- rnorm(n = n, mean = 0, sd = .1)

# How do they need to be multiplied?
logodds1 <- colMeans(b01 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds2 <- colMeans(b02 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds3 <- colMeans(b03 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))
logodds4 <- colMeans(b04 + b_low_sd %*% t(x_1) + b_high_sd %*% t(x_2))

prob_2to5 <- inv_logit(logodds1)
prob_3to5 <- inv_logit(logodds2)
prob_4to5 <- inv_logit(logodds3)
prob_5 <- inv_logit(logodds4)

prob_1 <- 1 - prob_2to5
prob_2 <- prob_2to5 - prob_3to5
prob_3 <- prob_3to5 - prob_4to5
prob_4 <- prob_4to5 - prob_5

y <- c()
for (i in 1:n) {
  y[i] <- sample(x = c(1:5), size = 1, 
    prob = c(prob_1[i], prob_2[i], prob_3[i], prob_4[i], prob_5[i])
  )
}

simdat <- tibble(x_1, x_2, y)
library(ordinal)
model <- clm(factor(y) ~ x_1 + x_2, data = simdat)
summary(model)

formula: factor(y) ~ x_1 + x_2
data:    simdat

 link  threshold nobs logLik  AIC    niter max.grad cond.H 
 logit flexible  100  -132.02 276.04 9(1)  5.68e-10 1.5e+03

Coefficients:
    Estimate Std. Error z value Pr(>|z|)
x_1   0.6967     1.9296   0.361    0.718
x_2   1.3985     2.0380   0.686    0.493

Threshold coefficients:
    Estimate Std. Error z value
1|2 -2.08282    0.32027  -6.503
2|3 -0.02223    0.20214  -0.110
3|4  0.13820    0.20247   0.683
4|5  2.33627    0.35123   6.652

References

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious mixed models. arXiv Preprint arXiv:1506.04967.

Breetvelt, I., van den Bergh, H., & Rijlaarsdam, G. (1994). Relations between writing processes and text quality: When and how? Cognition and Instruction, 12(2), 103–123.

Bürkner, P.-C. (2017). Advanced Bayesian multilevel modeling with the R package brms. arXiv Preprint arXiv:1705.11123.

Bürkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms. 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017

Bürkner, P.-C., & Vuorre, M. (2019). Ordinal regression models in psychology: A tutorial. Advances in Methods and Practices in Psychological Science, 2(1), 77–101.

Farrell, S., & Lewandowsky, S. (2018). Computational modeling of cognition and behavior. Cambridge University Press.

López, P., Torrance, M., & Fidalgo, R. (2019). The online management of writing processes and their contribution to text quality in upper-primary students. Psicothema: Revista de Psicologı́a, 31(3), 311–318.

Sorensen, T., Hohenstein, S., & Vasishth, S. (2016). Bayesian linear mixed models using stan: A tutorial for psychologists, linguists, and cognitive scientists. Quantitative Methods for Psychology, 12(3), 175–200.