knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
cache = FALSE,
warning = FALSE,
fig.width = 7,
fig.height = 4.5,
fig.align = "center"
)#install.packages('pacman')
library(pacman)
p_load(
tidyverse,
tidytext,
knitr,
kableExtra,
ggraph,
igraph,
psych,
plotly,
DT,
networkD3,
jsonlite,
htmltools
)Hello. This notebook is built for you, the student, to help you see what is happening when people say things like “we used an LLM to assist with thematic coding.” It is not a tool for actually doing your thesis analysis. We will come back to that in a moment, and again, and again, because it matters.
The whole point of this notebook is to take what is usually a black box — “a machine somehow turned interview text into codes” — and break it open into pieces small enough that you can read each one and say “ah, I see what this is doing.” Every weight, every embedding lookup, every probability, written in plain R. Nothing is hidden, and nothing is calling out to a server somewhere.
This version is fully interactive. Charts respond to hover and zoom. Tables can be sorted and searched. You can type your own text and watch the model predict codes for it in real time — all running inside your browser, no server required.
This is a teaching tool. Do not use this notebook as the analytic engine for any piece of qualitative research you intend to publish, submit for marking, or rely on. The model trained here is far too small to be reliable on real data. It will memorise the toy examples instead of learning anything generalisable. That is intentional — overfitting is what makes the mechanism visible. But it also means the model’s “codes” are not trustworthy in any methodological sense. If you take only one thing from this notebook, take that.
Before we touch any R, here is the whole pipeline as a story. If you can follow this story, you can follow the rest of the notebook.
[12, 8, 3, 41].Everything below is one of those steps, with the R code beside it. Click Show on any chunk to read the implementation; click Hide to step back to the prose.
So why bother? Because if you understand the eight steps above, you understand both what an LLM is doing under the hood and why you cannot trust it without checking. That is a more durable piece of knowledge than any specific analytic shortcut.
A code is a short descriptive label attached to a chunk of qualitative data. In Braun & Clarke’s RTA (2006, 2022), coding is phase two of six:
Coding can be inductive (codes emerge from the data) or deductive (codes are applied from a pre-existing codebook). The model below is a deductive one — we hand it a fixed codebook and it learns to apply those exact labels. Inductive coding requires generative language ability, which we do not have at this scale.
A code is not a theme. A theme is a higher-order pattern of meaning constructed by a researcher across codes. Themes are not in this notebook. Themes are your job.
Nine short, fully synthetic interview snippets across three participants. Each snippet has been hand-labelled with one or more codes from a small codebook below.
interviews <- tibble::tribble(
~participant_id, ~snippet_id, ~text,
"P01", 1, "Honestly the marking is the worst part. I get home at six and there's another three hours of essays waiting. My partner has stopped asking when I'll come to bed.",
"P01", 2, "I love teaching though. When a student finally gets it, that's the bit that keeps me going. It's just everything around teaching that's exhausting.",
"P01", 3, "I've started saying no to things. Committee work, optional meetings. I felt guilty at first but it's the only way I survive.",
"P02", 4, "I think the expectation is that you're always available. Email at 9pm, parents wanting calls on weekends. There's no boundary anymore.",
"P02", 5, "What helps is my colleagues. We have a WhatsApp group and we just vent. I don't think I'd cope without them honestly.",
"P02", 6, "The admin load has tripled in five years. I used to teach. Now I do compliance and call it teaching.",
"P03", 7, "I had to take stress leave last term. I didn't see it coming — I just collapsed one Sunday and couldn't stop crying.",
"P03", 8, "Coming back has been okay. They reduced my hours and I started therapy. But I don't think the underlying problem is solved at all.",
"P03", 9, "If I'm honest, I don't see myself doing this in five years. The young teachers I see are all looking for exits."
)Click any column header to sort. Use the search box to filter. This is real data — explore it.
codebook <- tibble::tribble(
~code, ~definition,
"workload", "References to volume, intensity, or unsustainability of work tasks.",
"boundary_erosion", "Difficulty separating work from personal life or rest time.",
"emotional_toll", "Mental health impacts: stress, anxiety, burnout, crying, collapse.",
"collegial_support", "Drawing strength from peer relationships at work.",
"intrinsic_reward", "Moments of meaning, joy, or vocation in the work itself.",
"exit_intentions", "Statements about leaving, reducing, or doubting future in the role.",
"coping_strategy", "Active steps taken to manage the situation (saying no, therapy, leave)."
)These are the “ground truth” labels a human (me) applied. The classifier will learn to mimic them.
labels <- tibble::tribble(
~snippet_id, ~code,
1, "workload",
1, "boundary_erosion",
2, "intrinsic_reward",
2, "workload",
3, "coping_strategy",
4, "boundary_erosion",
5, "collegial_support",
6, "workload",
7, "emotional_toll",
8, "coping_strategy",
8, "emotional_toll",
9, "exit_intentions",
9, "emotional_toll"
)
label_matrix <- labels |>
mutate(applies = 1L) |>
pivot_wider(
id_cols = snippet_id,
names_from = code,
values_from = applies,
values_fill = 0L
) |>
arrange(snippet_id)| name | value | what it controls |
|---|---|---|
| block_size | 32 | Max words per snippet (truncated or padded) |
| n_embd | 128 | Embedding dimensionality per word |
| n_iters | 3000 | Total gradient-descent steps |
| learning_rate | 0.2 | Plain SGD step size (no Adam — by hand we keep it simple) |
| threshold | 0.5 | Probability above which the model ‘applies’ a code |
| seed_val | 42 | Random seed so the run is reproducible |
Everything that follows is in pure base R. No
internet calls, no torch, no API keys. The “AI” you train
below is a small neural network written in plain matrix algebra so that
every operation is visible.
clean_text <- function(x) {
x |>
str_to_lower() |>
str_replace_all("[^a-z0-9' ]", " ") |>
str_squish()
}
build_vocab <- function(texts, min_freq = 1) {
tokens <- texts |> clean_text() |> str_split(" ") |> unlist()
word_freq <- tibble(word = tokens) |>
count(word, sort = TRUE) |>
filter(n >= min_freq, word != "")
tibble(
word = c("<pad>", "<unk>", word_freq$word),
id = seq_len(nrow(word_freq) + 2)
)
}
vocab <- build_vocab(interviews$text)
encode <- function(text, vocab, block_size = 32) {
words <- text |> clean_text() |> str_split(" ") |> unlist()
ids <- vocab$id[match(words, vocab$word)]
ids[is.na(ids)] <- 2L
if (length(ids) > block_size) {
ids <- ids[1:block_size]
} else {
ids <- c(ids, rep(1L, block_size - length(ids)))
}
ids
}Type any sentence below and see how the tokenizer converts it to
integer IDs. Words the model has never seen become
<unk> (ID 2).
So the input the model actually sees is a sequence of integers, not words. The model only learns the meaning of those integers through their embeddings (next section).
We size the model to be large enough that training takes around two
minutes on a student laptop (Surface Go class hardware). The whole
apparatus is two matrices and a vector of biases — no
torch, no automatic differentiation, no Python.
n_embd <- 128L
block_size <- 32L
learning_rate <- 0.2
n_iters <- 3000
seed_val <- 42
set.seed(seed_val)
vocab_size <- nrow(vocab)
n_codes <- nrow(codebook)
X <- t(sapply(interviews$text, encode, vocab = vocab, block_size = block_size))
storage.mode(X) <- "integer"
Y <- label_matrix |>
arrange(snippet_id) |>
select(-snippet_id) |>
as.matrix()
storage.mode(Y) <- "double"
E <- matrix(rnorm(vocab_size * n_embd, sd = 0.1),
nrow = vocab_size, ncol = n_embd)
W <- matrix(rnorm(n_embd * n_codes, sd = 0.1),
nrow = n_embd, ncol = n_codes)
b <- rep(0, n_codes)
cat(sprintf("Parameters: %s (vocab=%d x embd=%d + embd x codes=%d + bias=%d)\n",
format(vocab_size * n_embd + n_embd * n_codes + n_codes, big.mark = ","),
vocab_size, n_embd, n_codes, n_codes))## Parameters: 19,079 (vocab=142 x embd=128 + embd x codes=7 + bias=7)
sigmoid <- function(x) 1 / (1 + exp(-x))
forward <- function(X, E, W, b) {
N <- nrow(X)
pool <- matrix(0, nrow = N, ncol = ncol(E))
for (i in seq_len(N)) {
pool[i, ] <- colMeans(E[X[i, ], , drop = FALSE])
}
logits <- sweep(pool %*% W, 2, b, "+")
list(pool = pool, logits = logits)
}
bce_loss <- function(logits, Y) {
p <- sigmoid(logits)
eps <- 1e-10
-mean(Y * log(p + eps) + (1 - Y) * log(1 - p + eps))
}backward <- function(X, Y, E, W, fwd) {
N <- nrow(X)
T_ <- ncol(X)
d_logits <- (sigmoid(fwd$logits) - Y) / N
d_W <- t(fwd$pool) %*% d_logits
d_b <- colSums(d_logits)
d_pool <- d_logits %*% t(W)
d_E <- matrix(0, nrow = nrow(E), ncol = ncol(E))
for (i in seq_len(N)) {
grad_per_token <- d_pool[i, ] / T_
for (j in seq_len(T_)) {
d_E[X[i, j], ] <- d_E[X[i, j], ] + grad_per_token
}
}
list(d_E = d_E, d_W = d_W, d_b = d_b)
}input_file <- knitr::current_input()
if (is.null(input_file)) {
# fallback for interactive sessions
input_file <- rstudioapi::getActiveDocumentContext()$path
}
cache_path <- file.path(
getOption("tta.cache_dir"),
"tta_model_cache.rds"
)
if (file.exists(cache_path)) {
cache <- readRDS(cache_path)
E <- cache$E; W <- cache$W; b <- cache$b
loss_history <- cache$loss_history
cat("Loaded trained model from cache.\n")
} else {
loss_history <- numeric(n_iters)
t0 <- Sys.time()
for (step in seq_len(n_iters)) {
fwd <- forward(X, E, W, b)
loss <- bce_loss(fwd$logits, Y)
loss_history[step] <- loss
grads <- backward(X, Y, E, W, fwd)
E <- E - learning_rate * grads$d_E
W <- W - learning_rate * grads$d_W
b <- b - learning_rate * grads$d_b
}
elapsed <- round(as.numeric(difftime(Sys.time(), t0, units = "secs")), 1)
cat(sprintf("Training complete in %.1f seconds.\n", elapsed))
tryCatch(
saveRDS(list(E = E, W = W, b = b, loss_history = loss_history), cache_path),
error = function(e) message("Could not cache model: ", e$message)
)
}## Loaded trained model from cache.
## Started at loss 0.6873, ended at loss 0.0043
Hover over the curve to see the exact loss at each training step. Drag to zoom into a region. Double-click to reset.
The loss falls fast at first, then plateaus close to zero. That plateau means the model has essentially memorised the nine training snippets. This is not learning generalisation. This is learning the answers to the exam. Useful for understanding the mechanism; lethal if you mistake it for actual analytic capability.
Select a snippet from the dropdown below to see what the model predicts. The predictions are computed in your browser using the actual trained weights — nothing is pre-baked.
Read the table the way you would read any classifier’s output: probabilities close to 1 are codes the model is confident apply; close to 0 are codes it is confident don’t; anything in the middle is the model genuinely unsure. Confidence is not correctness.
Type your own text below and watch the model predict codes for it in real time. Try sentences about work stress, boundaries, colleagues, or quitting — then try something completely off-topic and watch the model flail.
Codes come from a pre-existing codebook — the researcher decides the categories before looking at data. The model learns to apply those exact labels. It cannot invent new ones.
Strengths: Faster coding, replicable, works well when existing theory provides a strong framework.
Limitations: You can only find what you are looking for. Novel meaning falls through the cracks.
Codes emerge from the data — the researcher reads closely and lets patterns surface. No pre-existing framework constrains the analysis.
Strengths: Discovery-oriented, grounded in participants’ lived experience, open to surprise.
Limitations: Slower, harder to replicate, requires deep familiarity with the data.
Why the model can’t do this: Inductive coding requires generative language ability — the capacity to name a new pattern. A tiny classifier with a fixed output layer cannot name anything; it can only say “yes” or “no” to labels it was given.
This tool lets you edit a codebook and watch keyword matches highlight in a sample passage — a transparent, rule-based alternative to the neural model above. Click any keyword cell to edit it. This mirrors how a deductive analyst might begin: start with keywords, then refine.
| Code | Keywords (comma-separated) — click to edit |
|---|
Matches shown with teal highlight. Edit keywords above to change what gets matched.
These plots are the same family of summaries you would produce in any thematic analysis write-up. The data they sit on is the model’s actual output.
coded <- purrr::map_dfr(seq_len(nrow(interviews)), function(i) {
row <- interviews[i, ]
ids <- encode(row$text, vocab, block_size)
pool <- colMeans(E[ids, , drop = FALSE])
logits <- as.numeric(pool %*% W + b)
probs <- sigmoid(logits)
tibble(
participant_id = row$participant_id,
snippet_id = row$snippet_id,
quote = row$text,
code = codebook$code[probs >= 0.5],
probability = round(probs[probs >= 0.5], 3)
)
})Hover for exact counts. Click a bar to isolate it.
Hover over any cell to see the participant, code, and count.
Drag nodes to rearrange. Hover for labels. Edge thickness = number of participants who share both codes.
The table below shows the model’s agreement with the human coder across all snippet-code cells. Edit any cell to see how kappa changes — this lets you explore what “agreement” actually means.
| Model: Yes | Model: No | |
|---|---|---|
| Human: Yes | ||
| Human: No |
| kappa | Landis & Koch (1977) interpretation |
|---|---|
| < 0.20 | Slight |
| 0.21–0.40 | Fair |
| 0.41–0.60 | Moderate |
| 0.61–0.80 | Substantial |
| 0.81–1.00 | Almost perfect |
A high kappa here just confirms that the model has memorised the training set — it would be alarming if it hadn’t. The interesting kappa would be one computed on held-out data, which we do not have at this scale.
torchFor completeness, here is the same architecture written in the
torch package — with the attention layer that we skipped in
the executable version. This chunk does not run; it is
here so you can see what the same idea looks like in an industrial ML
library.
library(torch)
attention_head <- nn_module(
"AttentionHead",
initialize = function(head_size, n_embd, dropout) {
self$key <- nn_linear(n_embd, head_size, bias = FALSE)
self$query <- nn_linear(n_embd, head_size, bias = FALSE)
self$value <- nn_linear(n_embd, head_size, bias = FALSE)
self$dropout <- nn_dropout(dropout)
},
forward = function(x) {
C <- x$size(3)
k <- self$key(x); q <- self$query(x); v <- self$value(x)
wei <- torch_matmul(q, k$transpose(2, 3)) * (C ^ -0.5)
wei <- nnf_softmax(wei, dim = -1) |> self$dropout()
torch_matmul(wei, v)
}
)
TextClassifier <- nn_module(
"TextClassifier",
initialize = function(vocab_size, n_codes, n_embd, n_head, block_size, dropout) {
head_size <- n_embd %/% n_head
self$tok_embed <- nn_embedding(vocab_size, n_embd)
self$pos_embed <- nn_embedding(block_size, n_embd)
self$attn <- nn_module_list(
lapply(seq_len(n_head),
\(i) attention_head(head_size, n_embd, dropout))
)
self$proj <- nn_linear(n_head * head_size, n_embd)
self$ln_f <- nn_layer_norm(n_embd)
self$head <- nn_linear(n_embd, n_codes)
},
forward = function(idx) {
T_ <- idx$size(2)
pos <- torch_arange(1, T_, dtype = torch_long())$to(device = idx$device)
x <- self$tok_embed(idx) + self$pos_embed(pos)$unsqueeze(1)
x <- self$proj(torch_cat(lapply(self$attn, \(h) h(x)), dim = -1))
x <- self$ln_f(x)
self$head(x$mean(dim = 2))
}
)<unk> and contribute nothing. A real corpus
would need word-piece tokenisation.tta_theme.css
and change the variables in :root. Replace
--teal: #5EEAD4 with anything you like and re-knit.init_model chunk. Double n_embd to 256 for a
richer embedding space. Halve n_iters to see what
undertraining looks like.tta_model_cache.rds from the project folder to force
retraining on next knit.code_folding: hide in the YAML to
default everything to collapsed.init_model chunk, set
n_embd = 16 and re-knit (delete the cache file first). Does
the loss bottom out at the same value? What about
n_embd = 256?interviews
and a corresponding row in labels. Re-knit. Use the custom
text predictor to test an unseen sentence. How wrong is it?