The data set analysed in this report is somewhat atypical, as for each participant we have multiple repetitions of the same outcomes (i.e. text quality indicators) that was associated with time course writing processes. The idea is that we want to infer the predicted text quality when participants do or do not engage with a certain writing process activity at a particular point during the writing process.
Breetvelt, Van den Bergh, and Rijlaarsdam (1994) have applied a regression model to these data treating each text quality measure associated with time \(t_0\) independent from their neighbouring observations at \(t_{-1}\) and \(t{_1}\) etc. Text quality scores are of course not independent across time course but identical for each writing session.
In the following simulation we will evaluate this approach. We will use three approaches to uncover known population parameter values using linear regressions (1) for unique (non-repeated) observations, (2) for repeated observations similar to Breetvelt, Van den Bergh, and Rijlaarsdam (1994), and (3) using a bootstrapping approach in which single observations per participant were repeatedly sampled from the duplicated data to remove the independence voilation in (2).
The first chunk of code is simulating the data rendering a data frame with unique observations.
library(tidyverse)
library(broom)
# Known parameter values
intercept <- 50
b1 <- 0
b2 <- 0
b3 <- 25
sigmae <- 40
n_time <- 10
N <- 100
time <- rep(1:n_time, N)
is_reading <- sample(c(0,1), n_time * N, replace = T)
# Simulated data
y <- rnorm(n_time * N,
mean = intercept + b1 * time + b2 * is_reading + b3 * time * is_reading,
sd = sigmae)
sim_data <- tibble(y = round(y), time = time, is_reading = is_reading)
sim_data
## # A tibble: 1,000 × 3
## y time is_reading
## <dbl> <int> <dbl>
## 1 44 1 1
## 2 57 2 1
## 3 194 3 1
## 4 122 4 1
## 5 96 5 0
## 6 81 6 0
## 7 190 7 1
## 8 27 8 0
## 9 106 9 0
## 10 359 10 1
## # … with 990 more rows
A linear model was used to uncover the parameter values from the simulated data. The results are shown in Figure 1.1 along with the true parameter values.
model <- lm(y ~ time + is_reading + time:is_reading, data = sim_data)
model_simple <- tidy(model, conf.int = T) %>%
select(term, estimate, starts_with("conf"))
sigma <- augment(model) %>%
summarise(across(.sigma, list(estimate = mean,
conf.low = lower,
conf.high = upper),
.names = "{.fn}")) %>%
mutate(term = "sigma")
model_simple <- bind_rows(model_simple, sigma)
The following modification of the structure of the simulated data frame is, more or less, akin to the real data set with multiple replications of the same observation.
# sample rows randomly
sim_data_rep <-
dplyr::slice(sim_data, rep(1:n(), each = 1000)) %>%
group_by(time) %>%
mutate(reps = 1:n()) %>%
ungroup(); sim_data_rep
## # A tibble: 1,000,000 × 4
## y time is_reading reps
## <dbl> <int> <dbl> <int>
## 1 44 1 1 1
## 2 44 1 1 2
## 3 44 1 1 3
## 4 44 1 1 4
## 5 44 1 1 5
## 6 44 1 1 6
## 7 44 1 1 7
## 8 44 1 1 8
## 9 44 1 1 9
## 10 44 1 1 10
## # … with 999,990 more rows
Parameter values were uncovered in the same way as for the previous version of the data frame with unique observations. Again, the results are shown in Figure 1.1.
model <- lm(y ~ time + is_reading + time:is_reading, data = sim_data_rep)
model_reps <- tidy(model, conf.int = T) %>%
select(term, estimate, starts_with("conf"))
sigma <- augment(model) %>%
summarise(across(.sigma, list(estimate = mean,
conf.low = lower,
conf.high = upper),
.names = "{.fn}")) %>%
mutate(term = "sigma")
model_reps <- bind_rows(model_reps, sigma)
The last analysis is using a bootstrapping approach to ensure that only unique observations are used from the previous version of the data frame. The first code is creating a data frame with unique replication identifiers reps
, and is then assigning time points randomly to the replications, so that in each simulation there are as many replications of each time point (similar to number of participants) as in the initial data set. This process is repeated for N_sims
simulations.
# Create random permutation
# Must be 1 participant per sim with about equal time points
times <- unique(time)
reps <- pull(sim_data_rep, reps) %>% unique()
N_sims <- 10000
data_sims <- map_dfr(1:N_sims,
~tibble(sim = .x,
reps = sample(reps, n_time * N)) %>%
mutate(time = sample(times, size = n(), replace = T))) %>%
arrange(sim, time)
data_sims %>% arrange(sim, reps)
The data frame with random replication-by-time permutations is then combined with the respective observations from the simulated and replicated data so that observations are unique within permutation (sim
). The data frame with all simulated permutations of replication and time was then analysed with one linear regression per simulation. From those linear models we extracted the inferred value for each parameter and summarised those as means and 95% confidence intervals.
model_bs_all <- data_sims %>%
inner_join(sim_data_rep, by = c("time", "reps")) %>%
nest(-sim) %>%
mutate(model = map(data, ~lm(y ~ time * is_reading, data = .)),
tidied = map(model, tidy, conf.int = T),
augmented = map(model, augment))
sigma <- model_bs_all %>%
unnest(cols = augmented) %>%
select(.sigma) %>%
drop_na() %>%
summarise(across(.sigma, list(estimate = mean,
conf.low = lower,
conf.high = upper),
.names = "{.fn}")) %>%
mutate(term = "sigma")
model_bs <- model_bs_all %>%
unnest(cols = tidied) %>%
#dplyr::select(term, estimate) %>%
dplyr::select(sim, term, estimate, starts_with("conf.")) %>%
drop_na() %>%
group_by(term) %>%
summarise(across(estimate,
list(estimate = mean,
conf.low = lower,
conf.high = upper),
.names = "{.fn}" )) %>%
ungroup() %>%
bind_rows(sigma)
Figure 1.1 shows the three modelling approaches in comparison to the true parameter values. All three modelling approaches rendered an average parameter value estimate close to the true parameter value. However, the analysis of the duplicated observations underestimates the confidence bands of all parameters compared to the analysis of unique observations. The boostrapping approach mirrors the results of the analysis of unique observations but based on data with duplicated observations. In other words, the boostrapping approach is returning the expected results without violating independence and captures the true parameter value.
Figure 1.1: Comparison of uncovered parameter values for each approach and the true known parameter values.
Note aside that the results for the analysis with unique observations and the bootstrapping analysis are relatively similar. However, the unique-observations analysis is more likely to vary from simulation to simulation and will be more sensitive to variations due to unbalanced observations than the bootstrapping analysis.
# Load data
# and get data in right format
data <- read_csv("../data/by_event_data_with_quality_measures.csv") %>%
filter(location != "post-paragraph",
!is.na(location),
next_event_type != "block-operation",
event_type != "block-operation",
event_duration > 50) %>%
mutate(overall_rating = rater_zoe + rater_jamie,
across(c(totalfix_dur, event_duration), log),
across(is_sustained_reading, replace_na, FALSE),
across(starts_with("pseq_"), replace_na, 0),
is_sentence = location == "post-sentence",
is_word = location == "post-word",
across(overall_rating, as.ordered)) %>%
group_by(token) %>%
mutate(is_new_edge = edge_id != lag(edge_id),
across(is_new_edge, replace_na, FALSE)) %>%
ungroup() %>%
select(token, prompt, session_no, participant, time_onset,
time_norm, event_duration, is_edit,
is_sentence, is_word,
median_fixation_distance_words,
has_sustained_reading,
is_lookback, totalfix_dur, is_jump,
is_major_block, char_count:lexd,
oc_ratio, overall_rating, is_new_edge,
pseq_id, non_lookback_id) %>%
# Change Boolean to numeric
mutate(across(where(is.logical), as.numeric))
# Remove (mainly) within word data equally distributed over text.
# data <- data %>%
# mutate(time_norm_group = factor(round(time_norm,1))) %>%
# group_by(token, is_edit, is_sentence, is_word, time_norm_group, is_lookback) %>%
# mutate(idx = sample(n())) %>%
# ungroup() %>%
# filter(idx <= 5) %>%
# select(-time_norm_group, -idx) %>%
# drop_na()
# Might be necessary for Bayesian model
# Fixed effects formula
x_fes <- 'time_onset +
time_norm +
time_norm :
( is_lookback +
is_edit +
event_duration +
pseq_id +
is_major_block +
is_new_edge +
is_jump +
non_lookback_id +
is_lookback : totalfix_dur : (is_word + is_sentence) +
is_lookback : has_sustained_reading +
is_lookback : median_fixation_distance_words +
is_edit : (is_word + is_sentence) +
event_duration : (is_word + is_sentence + is_edit))'
ys <- c('overall_rating', 'char_count', 'word_count', 'sentence_count', 'M_word_len', 'M_sent_len', 'lexd', 'oc_ratio')
forms <- map(ys, str_c, x_fes, sep = " ~ ")
x_res <- '(1|participant)'
x_mix <- str_c(forms, x_res, sep = " + ")
x_mix <- map(x_mix, as.formula)
Empty list for mixed-effects models
mixeff_models <- list()
# Does not work
mixeff_models[[1]] <- clmm(x_mix[[1]], data = data)
# Does not work
mixeff_models[[2]] <- glmer(x_mix[[2]], data = data, family = poisson())
# Does not work
mixeff_models[[3]] <- glmer(x_mix[[3]], data = data, family = poisson())
# Does not work
mixeff_models[[4]] <- glmer(x_mix[4], data = data, family = poisson())
mixeff_models[[5]] <- lmer(x_mix[[5]], data = data)
mixeff_models[[6]] <- lmer(x_mix[[6]], data = data)
mixeff_models[[7]] <- lmer(x_mix[[7]], data = data)
mixeff_models[[8]] <- lmer(x_mix[[8]], data = data)
lmer
sAs the ordinal and count models above did not work, I’ll refit all models as continuous / Gaussian model after transforming and standardising the outcome variable (as appropriate).
# Transform and standardise outcome variables
data2 <- data %>%
mutate(across(overall_rating, as.numeric),
across(c(ends_with("count")), ~log(.+1)),
across(c(overall_rating, char_count, word_count,
sentence_count, M_word_len, M_sent_len,
lexd, oc_ratio), ~scale(.)[,1]))
Apply lmer
to all formulas with transformed data in data frame data2
.
mixeff_models <- map(x_mix, lmer, data = data2)
# Get name of outcome variable
get_response <- function(formula) {
tt <- terms(formula)
vars <- as.character(attr(tt, "variables"))[-1] ## [1] is the list call
response <- attr(tt, "response") # index of response var
vars[response]
}
# Extract slopes
mixeff_slopes <- map_dfr(mixeff_models, ~tidy(.x, conf.int = T) %>%
mutate(outcome = get_response(.x))) %>%
filter(effect == "fixed",
!str_detect(term, "Intercept")) %>%
select(outcome, term, where(is.numeric))
Some estimates / errors are very small. This is probably related to a model misspecification.
mixeff_slopes %>%
ggplot(aes(y = estimate,
ymin = conf.low,
ymax = conf.high,
x = term,
colour = outcome)) +
# facet_wrap(~outcome) +
geom_point(position = position_dodge(.5)) +
geom_errorbar(position = position_dodge(.5)) +
coord_flip() +
labs(y = "Estimate with 95% CIs",
x = "Predictor",
colour = "Text-quality\nmeasures") +
theme(legend.justification = "top")
Figure 5.1: Mixed effects model.
x_forms <- map(forms, as.formula)
# Empty list for models
models <- list()
models[[1]] <- clm(x_forms[[1]], data = data)
models[[2]] <- glm(x_forms[[2]], data = data, family = poisson())
models[[3]] <- glm(x_forms[[3]], data = data, family = poisson())
models[[4]] <- glm(x_forms[[4]], data = data, family = poisson())
models[[5]] <- lm(x_forms[[5]], data = data)
models[[6]] <- lm(x_forms[[6]], data = data)
models[[7]] <- lm(x_forms[[7]], data = data)
models[[8]] <- lm(x_forms[[8]], data = data)
Models without random effects run without problems.
lm
sFor comparability I rerun these models with transformed / scaled outcomes variables in lm
.
models <- map(x_forms, lm, data = data2)
# Extract slopes
m_slopes <- map_dfr(models, ~tidy(.x, conf.int = T) %>%
mutate(outcome = get_response(.x))) %>%
filter(!str_detect(term, "Intercept|\\|")) %>%
select(outcome, term, where(is.numeric))
m_slopes %>%
ggplot(aes(y = estimate,
ymin = conf.low,
ymax = conf.high,
x = term,
colour = outcome)) +
# facet_wrap(~outcome) +
geom_point(position = position_dodge(.5)) +
geom_errorbar(position = position_dodge(.5)) +
coord_flip() +
labs(y = "Estimate with 95% CIs",
x = "Predictor",
colour = "Text-quality\nmeasures") +
theme(legend.justification = "top")
Figure 6.1: Coefficients of model without random participant intercepts.
So far even simple Bayesian models akin to the frequentist models above failed. Sampler got stuck in small numeric areas which might be because outcome variables were repeated cross multiple rows.
The only models that showed alright convergence were gausians including the multivariate gausian with time as predictor
need to decide which predictors to include
# Sample multiple time points across tokens
# but only 1 per token
# N of token is max time of time points per lm
time_points <- as.character(seq(0, 1, by = .01))
tokens <- pull(data, token) %>% unique()
N_tokens <- length(tokens)
data <- mutate(data, time_norm_group = round(time_norm, 2),
across(time_norm_group, as.character))
# Create random permutation
# This code, atm, assumes that there are more time points than tokens
# this this isn't true, the tibble and mutate arguments needs to be swapped
N_sims <- 1000
data_sims <- map_dfr(1:N_sims,
~tibble(time_norm_group = sample(time_points,
size = N_tokens)) %>%
mutate(token = sample(tokens, size = n()),
sim = .x) %>%
arrange(time_norm_group))
inner_join(data, data_sims, by = c("time_norm_group", "token")) %>%
# count(sim, time_norm_group, token)
nest(-sim) %>%
mutate(model = map(data, ~lm(M_word_len ~ time_norm * is_lookback, data = .)),
tidied = map(model, tidy, conf.int = T)) %>%
unnest(tidied) %>%
# filter(!str_detect(term, "Intercept")) %>%
dplyr::select(term, estimate, starts_with("conf")) %>%
# ggplot(aes(x = estimate, colour = term)) +
# geom_density()
group_by(term) %>%
summarise(across(c(estimate, starts_with("conf")),
list(mean = mean),
.names = "{.col}")) %>%
ungroup()
## # A tibble: 4 × 4
## term estimate conf.low conf.high
## <chr> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.73 4.72 4.74
## 2 is_lookback 0.0123 -0.0968 0.121
## 3 time_norm 0.00714 -0.0153 0.0296
## 4 time_norm:is_lookback 0.0241 -0.161 0.209