knitr::opts_chunk$set(
  echo     = TRUE,
  message  = FALSE,
  cache = FALSE,
  warning  = FALSE,
  fig.width  = 7,
  fig.height = 4.5,
  fig.align  = "center"
)
#install.packages('pacman')

library(pacman)
p_load(
  tidyverse,
  tidytext,
  knitr,
  kableExtra,
  ggraph,
  igraph,
  psych,
  plotly,
  DT,
  networkD3,
  jsonlite,
  htmltools
)

An interactive teaching artefact · not a research instrument

Welcome — and please read this first

Hello. This notebook is built for you, the student, to help you see what is happening when people say things like “we used an LLM to assist with thematic coding.” It is not a tool for actually doing your thesis analysis. We will come back to that in a moment, and again, and again, because it matters.

The whole point of this notebook is to take what is usually a black box — “a machine somehow turned interview text into codes” — and break it open into pieces small enough that you can read each one and say “ah, I see what this is doing.” Every weight, every embedding lookup, every probability, written in plain R. Nothing is hidden, and nothing is calling out to a server somewhere.

This version is fully interactive. Charts respond to hover and zoom. Tables can be sorted and searched. You can type your own text and watch the model predict codes for it in real time — all running inside your browser, no server required.

This is a teaching tool. Do not use this notebook as the analytic engine for any piece of qualitative research you intend to publish, submit for marking, or rely on. The model trained here is far too small to be reliable on real data. It will memorise the toy examples instead of learning anything generalisable. That is intentional — overfitting is what makes the mechanism visible. But it also means the model’s “codes” are not trustworthy in any methodological sense. If you take only one thing from this notebook, take that.


1 How the code in this notebook works (in plain English)

Before we touch any R, here is the whole pipeline as a story. If you can follow this story, you can follow the rest of the notebook.

  1. We start with text. Some short interview snippets, plus a small “codebook” — a list of analytic labels (like workload, boundary erosion, emotional toll) that we want the machine to learn to apply.
  2. We turn text into numbers. A computer cannot learn from words directly. So we build a tiny dictionary — every distinct word in our toy corpus gets an integer ID. The sentence “the marking is exhausting” might become [12, 8, 3, 41].
  3. We look up an embedding for each number. An embedding is just a small vector of numbers (we use 128 of them) that the model is allowed to learn. At first these vectors are random; by the end of training, words with similar meanings sit closer together in this 128-dimensional space.
  4. We squash the snippet into a single vector. We take the average of the per-word embeddings — mean-pooling. That single 128-dimensional vector now represents the whole snippet.
  5. We ask: which codes apply? A final linear layer maps that 128-vector to one number per code in the codebook. We pass each through a sigmoid so the number lies between 0 and 1, and read it as “probability that this code applies.”
  6. We train. We show the model labelled examples (snippet + the codes a human applied), measure how wrong its predictions are (binary cross-entropy loss), and nudge every weight a little in the direction that would have made it less wrong. Repeat ~3,000 times.
  7. We use it. Once trained, we feed in new snippets and read the predicted codes off.
  8. We then check it against ourselves. Because nothing the model says is reliable until a human researcher says it is.

Everything below is one of those steps, with the R code beside it. Click Show on any chunk to read the implementation; click Hide to step back to the prose.


2 What this is not

  • It is not a replacement for reading your transcripts. Familiarisation is phase one of Braun & Clarke’s reflexive thematic analysis (RTA) for a reason — you cannot meaningfully interpret what you have not read.
  • It is not a way to make your analysis more “objective” or “scientific.” RTA is interpretive; pretending a small classifier removes the researcher from the loop is a methodological category error.
  • It is not something to point at in your methods chapter as “I used machine learning to code my data.” If you do that, your supervisor will (rightly) ask harder questions than this notebook can answer.
  • It is not a comparable tool to a real LLM-API-based pipeline (e.g. one calling Claude or GPT to do the coding). Those pipelines are designed to augment a careful manual coding process. This notebook is purely educational and never sends any data anywhere.

So why bother? Because if you understand the eight steps above, you understand both what an LLM is doing under the hood and why you cannot trust it without checking. That is a more durable piece of knowledge than any specific analytic shortcut.


3 What is “coding” in thematic analysis again?

A code is a short descriptive label attached to a chunk of qualitative data. In Braun & Clarke’s RTA (2006, 2022), coding is phase two of six:

  1. Familiarisation
  2. Generating initial codes — this is the part we are caricaturing in code
  3. Searching for themes
  4. Reviewing themes
  5. Defining and naming themes
  6. Producing the report

Coding can be inductive (codes emerge from the data) or deductive (codes are applied from a pre-existing codebook). The model below is a deductive one — we hand it a fixed codebook and it learns to apply those exact labels. Inductive coding requires generative language ability, which we do not have at this scale.

A code is not a theme. A theme is a higher-order pattern of meaning constructed by a researcher across codes. Themes are not in this notebook. Themes are your job.


4 Setup

4.1 The toy dataset

Nine short, fully synthetic interview snippets across three participants. Each snippet has been hand-labelled with one or more codes from a small codebook below.

interviews <- tibble::tribble(
  ~participant_id, ~snippet_id, ~text,
  "P01", 1, "Honestly the marking is the worst part. I get home at six and there's another three hours of essays waiting. My partner has stopped asking when I'll come to bed.",
  "P01", 2, "I love teaching though. When a student finally gets it, that's the bit that keeps me going. It's just everything around teaching that's exhausting.",
  "P01", 3, "I've started saying no to things. Committee work, optional meetings. I felt guilty at first but it's the only way I survive.",
  "P02", 4, "I think the expectation is that you're always available. Email at 9pm, parents wanting calls on weekends. There's no boundary anymore.",
  "P02", 5, "What helps is my colleagues. We have a WhatsApp group and we just vent. I don't think I'd cope without them honestly.",
  "P02", 6, "The admin load has tripled in five years. I used to teach. Now I do compliance and call it teaching.",
  "P03", 7, "I had to take stress leave last term. I didn't see it coming — I just collapsed one Sunday and couldn't stop crying.",
  "P03", 8, "Coming back has been okay. They reduced my hours and I started therapy. But I don't think the underlying problem is solved at all.",
  "P03", 9, "If I'm honest, I don't see myself doing this in five years. The young teachers I see are all looking for exits."
)

Click any column header to sort. Use the search box to filter. This is real data — explore it.

4.2 The codebook

codebook <- tibble::tribble(
  ~code,                 ~definition,
  "workload",            "References to volume, intensity, or unsustainability of work tasks.",
  "boundary_erosion",    "Difficulty separating work from personal life or rest time.",
  "emotional_toll",      "Mental health impacts: stress, anxiety, burnout, crying, collapse.",
  "collegial_support",   "Drawing strength from peer relationships at work.",
  "intrinsic_reward",    "Moments of meaning, joy, or vocation in the work itself.",
  "exit_intentions",     "Statements about leaving, reducing, or doubting future in the role.",
  "coping_strategy",     "Active steps taken to manage the situation (saying no, therapy, leave)."
)

4.3 The hand-coded labels

These are the “ground truth” labels a human (me) applied. The classifier will learn to mimic them.

labels <- tibble::tribble(
  ~snippet_id, ~code,
  1, "workload",
  1, "boundary_erosion",
  2, "intrinsic_reward",
  2, "workload",
  3, "coping_strategy",
  4, "boundary_erosion",
  5, "collegial_support",
  6, "workload",
  7, "emotional_toll",
  8, "coping_strategy",
  8, "emotional_toll",
  9, "exit_intentions",
  9, "emotional_toll"
)

label_matrix <- labels |>
  mutate(applies = 1L) |>
  pivot_wider(
    id_cols     = snippet_id,
    names_from  = code,
    values_from = applies,
    values_fill = 0L
  ) |>
  arrange(snippet_id)

5 Hyperparameters

name value what it controls
block_size 32 Max words per snippet (truncated or padded)
n_embd 128 Embedding dimensionality per word
n_iters 3000 Total gradient-descent steps
learning_rate 0.2 Plain SGD step size (no Adam — by hand we keep it simple)
threshold 0.5 Probability above which the model ‘applies’ a code
seed_val 42 Random seed so the run is reproducible

Everything that follows is in pure base R. No internet calls, no torch, no API keys. The “AI” you train below is a small neural network written in plain matrix algebra so that every operation is visible.


6 Building the classifier, piece by piece

6.1 The tokenizer (word-level)

clean_text <- function(x) {
  x |>
    str_to_lower() |>
    str_replace_all("[^a-z0-9' ]", " ") |>
    str_squish()
}

build_vocab <- function(texts, min_freq = 1) {
  tokens <- texts |> clean_text() |> str_split(" ") |> unlist()
  word_freq <- tibble(word = tokens) |>
    count(word, sort = TRUE) |>
    filter(n >= min_freq, word != "")

  tibble(
    word = c("<pad>", "<unk>", word_freq$word),
    id   = seq_len(nrow(word_freq) + 2)
  )
}

vocab <- build_vocab(interviews$text)

encode <- function(text, vocab, block_size = 32) {
  words <- text |> clean_text() |> str_split(" ") |> unlist()
  ids   <- vocab$id[match(words, vocab$word)]
  ids[is.na(ids)] <- 2L
  if (length(ids) > block_size) {
    ids <- ids[1:block_size]
  } else {
    ids <- c(ids, rep(1L, block_size - length(ids)))
  }
  ids
}

Interactive tokenizer demo

Type any sentence below and see how the tokenizer converts it to integer IDs. Words the model has never seen become <unk> (ID 2).

Tokenizer Demo

So the input the model actually sees is a sequence of integers, not words. The model only learns the meaning of those integers through their embeddings (next section).


6.2 Initialising the model

We size the model to be large enough that training takes around two minutes on a student laptop (Surface Go class hardware). The whole apparatus is two matrices and a vector of biases — no torch, no automatic differentiation, no Python.

n_embd        <- 128L
block_size    <- 32L
learning_rate <- 0.2
n_iters       <- 3000
seed_val      <- 42

set.seed(seed_val)

vocab_size <- nrow(vocab)
n_codes    <- nrow(codebook)

X <- t(sapply(interviews$text, encode, vocab = vocab, block_size = block_size))
storage.mode(X) <- "integer"

Y <- label_matrix |>
  arrange(snippet_id) |>
  select(-snippet_id) |>
  as.matrix()
storage.mode(Y) <- "double"

E <- matrix(rnorm(vocab_size * n_embd, sd = 0.1),
            nrow = vocab_size, ncol = n_embd)
W <- matrix(rnorm(n_embd     * n_codes, sd = 0.1),
            nrow = n_embd,     ncol = n_codes)
b <- rep(0, n_codes)

cat(sprintf("Parameters: %s (vocab=%d x embd=%d + embd x codes=%d + bias=%d)\n",
    format(vocab_size * n_embd + n_embd * n_codes + n_codes, big.mark = ","),
    vocab_size, n_embd, n_codes, n_codes))
## Parameters: 19,079 (vocab=142 x embd=128 + embd x codes=7 + bias=7)

6.3 The forward pass

sigmoid <- function(x) 1 / (1 + exp(-x))

forward <- function(X, E, W, b) {
  N <- nrow(X)
  pool <- matrix(0, nrow = N, ncol = ncol(E))
  for (i in seq_len(N)) {
    pool[i, ] <- colMeans(E[X[i, ], , drop = FALSE])
  }
  logits <- sweep(pool %*% W, 2, b, "+")
  list(pool = pool, logits = logits)
}

bce_loss <- function(logits, Y) {
  p   <- sigmoid(logits)
  eps <- 1e-10
  -mean(Y * log(p + eps) + (1 - Y) * log(1 - p + eps))
}

6.4 The backward pass (manual gradients)

backward <- function(X, Y, E, W, fwd) {
  N  <- nrow(X)
  T_ <- ncol(X)

  d_logits <- (sigmoid(fwd$logits) - Y) / N
  d_W <- t(fwd$pool) %*% d_logits
  d_b <- colSums(d_logits)
  d_pool <- d_logits %*% t(W)

  d_E <- matrix(0, nrow = nrow(E), ncol = ncol(E))
  for (i in seq_len(N)) {
    grad_per_token <- d_pool[i, ] / T_
    for (j in seq_len(T_)) {
      d_E[X[i, j], ] <- d_E[X[i, j], ] + grad_per_token
    }
  }

  list(d_E = d_E, d_W = d_W, d_b = d_b)
}

6.5 The training loop

input_file <- knitr::current_input()

if (is.null(input_file)) {
  # fallback for interactive sessions
  input_file <- rstudioapi::getActiveDocumentContext()$path
}

cache_path <- file.path(
  getOption("tta.cache_dir"),
  "tta_model_cache.rds"
)

if (file.exists(cache_path)) {
  cache <- readRDS(cache_path)
  E <- cache$E; W <- cache$W; b <- cache$b
  loss_history <- cache$loss_history
  cat("Loaded trained model from cache.\n")
} else {
  loss_history <- numeric(n_iters)
  t0 <- Sys.time()

  for (step in seq_len(n_iters)) {
    fwd  <- forward(X, E, W, b)
    loss <- bce_loss(fwd$logits, Y)
    loss_history[step] <- loss

    grads <- backward(X, Y, E, W, fwd)

    E <- E - learning_rate * grads$d_E
    W <- W - learning_rate * grads$d_W
    b <- b - learning_rate * grads$d_b
  }

  elapsed <- round(as.numeric(difftime(Sys.time(), t0, units = "secs")), 1)
  cat(sprintf("Training complete in %.1f seconds.\n", elapsed))

  tryCatch(
    saveRDS(list(E = E, W = W, b = b, loss_history = loss_history), cache_path),
    error = function(e) message("Could not cache model: ", e$message)
  )
}
## Loaded trained model from cache.
cat(sprintf("Started at loss %.4f, ended at loss %.4f\n",
            loss_history[1], tail(loss_history, 1)))
## Started at loss 0.6873, ended at loss 0.0043

7 The loss curve

Hover over the curve to see the exact loss at each training step. Drag to zoom into a region. Double-click to reset.

The loss falls fast at first, then plateaus close to zero. That plateau means the model has essentially memorised the nine training snippets. This is not learning generalisation. This is learning the answers to the exam. Useful for understanding the mechanism; lethal if you mistake it for actual analytic capability.


8 Using the trained model

8.1 Live coding demo

Select a snippet from the dropdown below to see what the model predicts. The predictions are computed in your browser using the actual trained weights — nothing is pre-baked.

Select a snippet

Model predictions

Read the table the way you would read any classifier’s output: probabilities close to 1 are codes the model is confident apply; close to 0 are codes it is confident don’t; anything in the middle is the model genuinely unsure. Confidence is not correctness.


8.2 Custom text predictor

Type your own text below and watch the model predict codes for it in real time. Try sentences about work stress, boundaries, colleagues, or quitting — then try something completely off-topic and watch the model flail.

Type your own text

Model says...


9 Inductive vs deductive coding

9.0.0.1 Deductive coding (what this model does)

Codes come from a pre-existing codebook — the researcher decides the categories before looking at data. The model learns to apply those exact labels. It cannot invent new ones.

Strengths: Faster coding, replicable, works well when existing theory provides a strong framework.

Limitations: You can only find what you are looking for. Novel meaning falls through the cracks.

9.0.0.2 Inductive coding (what this model cannot do)

Codes emerge from the data — the researcher reads closely and lets patterns surface. No pre-existing framework constrains the analysis.

Strengths: Discovery-oriented, grounded in participants’ lived experience, open to surprise.

Limitations: Slower, harder to replicate, requires deep familiarity with the data.

Why the model can’t do this: Inductive coding requires generative language ability — the capacity to name a new pattern. A tiny classifier with a fixed output layer cannot name anything; it can only say “yes” or “no” to labels it was given.


10 Codebook keyword matcher

This tool lets you edit a codebook and watch keyword matches highlight in a sample passage — a transparent, rule-based alternative to the neural model above. Click any keyword cell to edit it. This mirrors how a deductive analyst might begin: start with keywords, then refine.

Editable Codebook

CodeKeywords (comma-separated) — click to edit

Highlighted passage

Matches shown with teal highlight. Edit keywords above to change what gets matched.


11 Statistics on the model’s predictions

These plots are the same family of summaries you would produce in any thematic analysis write-up. The data they sit on is the model’s actual output.

coded <- purrr::map_dfr(seq_len(nrow(interviews)), function(i) {
  row    <- interviews[i, ]
  ids    <- encode(row$text, vocab, block_size)
  pool   <- colMeans(E[ids, , drop = FALSE])
  logits <- as.numeric(pool %*% W + b)
  probs  <- sigmoid(logits)

  tibble(
    participant_id = row$participant_id,
    snippet_id     = row$snippet_id,
    quote          = row$text,
    code           = codebook$code[probs >= 0.5],
    probability    = round(probs[probs >= 0.5], 3)
  )
})

11.1 Code frequency

Hover for exact counts. Click a bar to isolate it.

11.2 Per-participant heatmap

Hover over any cell to see the participant, code, and count.

11.3 Co-occurrence network

Drag nodes to rearrange. Hover for labels. Edge thickness = number of participants who share both codes.


12 Validation — Cohen’s kappa

12.1 Interactive kappa calculator

The table below shows the model’s agreement with the human coder across all snippet-code cells. Edit any cell to see how kappa changes — this lets you explore what “agreement” actually means.

Agreement matrix (click cells to edit)

Model: YesModel: No
Human: Yes
Human: No

Cohen's kappa

1

kappa Landis & Koch (1977) interpretation
< 0.20 Slight
0.21–0.40 Fair
0.41–0.60 Moderate
0.61–0.80 Substantial
0.81–1.00 Almost perfect

A high kappa here just confirms that the model has memorised the training set — it would be alarming if it hadn’t. The interesting kappa would be one computed on held-out data, which we do not have at this scale.


13 Reference — how this looks in torch

For completeness, here is the same architecture written in the torch package — with the attention layer that we skipped in the executable version. This chunk does not run; it is here so you can see what the same idea looks like in an industrial ML library.

library(torch)

attention_head <- nn_module(
  "AttentionHead",
  initialize = function(head_size, n_embd, dropout) {
    self$key     <- nn_linear(n_embd, head_size, bias = FALSE)
    self$query   <- nn_linear(n_embd, head_size, bias = FALSE)
    self$value   <- nn_linear(n_embd, head_size, bias = FALSE)
    self$dropout <- nn_dropout(dropout)
  },
  forward = function(x) {
    C   <- x$size(3)
    k   <- self$key(x);   q <- self$query(x);   v <- self$value(x)
    wei <- torch_matmul(q, k$transpose(2, 3)) * (C ^ -0.5)
    wei <- nnf_softmax(wei, dim = -1) |> self$dropout()
    torch_matmul(wei, v)
  }
)

TextClassifier <- nn_module(
  "TextClassifier",
  initialize = function(vocab_size, n_codes, n_embd, n_head, block_size, dropout) {
    head_size      <- n_embd %/% n_head
    self$tok_embed <- nn_embedding(vocab_size, n_embd)
    self$pos_embed <- nn_embedding(block_size, n_embd)
    self$attn      <- nn_module_list(
      lapply(seq_len(n_head),
             \(i) attention_head(head_size, n_embd, dropout))
    )
    self$proj <- nn_linear(n_head * head_size, n_embd)
    self$ln_f <- nn_layer_norm(n_embd)
    self$head <- nn_linear(n_embd, n_codes)
  },
  forward = function(idx) {
    T_  <- idx$size(2)
    pos <- torch_arange(1, T_, dtype = torch_long())$to(device = idx$device)
    x   <- self$tok_embed(idx) + self$pos_embed(pos)$unsqueeze(1)
    x   <- self$proj(torch_cat(lapply(self$attn, \(h) h(x)), dim = -1))
    x   <- self$ln_f(x)
    self$head(x$mean(dim = 2))
  }
)

14 Limitations to internalise

  1. The model has memorised the training set. It does not “know” what workload means. It has learned which integer-id patterns co-occur with the workload label in nine snippets.
  2. No reflexivity. The model has no positionality, no theoretical commitments, no situated reading. RTA is not an analysis you can “run.” It is one you do.
  3. No latent meaning. Tiny classifiers handle surface-level lexical patterns. Sarcasm, irony, embodied knowledge, culturally specific meaning, code-switching — all invisible.
  4. Scale-dependent. Many of the things that look impressive in commercial LLMs are emergent — they appear only at billions of parameters. None of those will appear here.
  5. Replicability is not validity. A model can apply the wrong code consistently and reliably forever.
  6. Overfit by design. We train until the loss is essentially zero on the training set. That is the opposite of what you want for inference on real data.
  7. Tiny vocabulary. Words the model has never seen become <unk> and contribute nothing. A real corpus would need word-piece tokenisation.

15 Hints for your own RStudio work


16 Exercises


17 Further reading


Thematic Analysis Assistant · authored by Harry Stanley · built with Claude Code · 2026