Observer Feedback as Data: Text Mining the DoDEA Americas Mid-Atlantic Learning Walkthrough Corpus

Author

Evan Scherr

Published

April 21, 2026

Show code

library(tidyverse)
library(readxl)
library(tidytext)
library(topicmodels)
library(tm)
library(glmnet)
library(textstem)
library(knitr)
library(kableExtra)
library(scales)
library(lubridate)
library(textdata)
library(reshape2)

theme_lwt <- function() {
  theme_minimal(base_size = 12) +
    theme(
      plot.title    = element_text(face = "bold", size = 13),
      plot.subtitle = element_text(color = "gray40", size = 11),
      strip.text    = element_text(face = "bold"),
      panel.grid.minor = element_blank()
    )
}

label_lookup <- c(
  "1"  = "1: Observer Process Notes",
  "2"  = "2: Early Literacy and Phonics",
  "3"  = "3: Math Reasoning and Discourse",
  "4"  = "4: Writing and Textual Evidence",
  "5"  = "5: Guided Reading and Comprehension",
  "6"  = "6: Success Criteria and Stations",
  "7"  = "7: Early Childhood Activity Learning",
  "8"  = "8: Academic Discourse and Peer Collaboration",
  "9"  = "9: Math Goals and Instructional Strategy",
  "10" = "10: Classroom Environment and Routines"
)

Introduction

Purpose and Research Questions

This report applies text mining methods to the Learning Walkthrough Tool (LWT) corpus from DoDEA Americas Mid-Atlantic for School Year 2025-2026. The LWT is a structured observation instrument used by instructional coaches and district administrators to document classroom visits and deliver written feedback to teachers. Each record includes a free-text narrative field (“Recognitions and Considerations”) alongside structured ratings across 13 Success Factor (SF) indicators. The full dataset contains 4,003 observation records spanning 27 schools; after filtering to Teacher Feedback records with non-missing narratives, 3,843 records are available for analysis.

The narrative field generates roughly four million characters of observer feedback annually that has never been subjected to systematic analysis. This project treats that text as data and applies a sequence of text mining methods to interrogate it. Three research questions guide the analysis:

Topic-SF Alignment. What latent topics emerge from LWT observer narratives, and do they align with the 13 SF indicators the instrument was designed to capture?
Observer Language Variation. Do feedback texts vary systematically by observer or school in ways that suggest construct-irrelevant language patterns rather than instructional differences?
SF Predictability from Text. Can narrative text features predict structured SF ratings, and which indicators are most and least recoverable from the text?

These questions matter beyond methodological curiosity. If observer narratives and structured ratings are not aligned, either the rating scale or the writing prompt is failing to do its job. If individual observers write in idiosyncratic ways that are statistically distinguishable from one another, the LWT is producing less comparable data across schools than administrators assume.

Target Audience

The primary audience is DoDEA Americas Mid-Atlantic district leadership, including curriculum coordinators and the community superintendent, who use LWT summary data to set instructional priorities and evaluate coaching effectiveness. A secondary audience is the educational measurement community interested in observer-generated text as a validity data source for observation instruments.

Data Sources

Show code

lwt_raw <- read_excel("MAD-LWT-Through-4-12-26.xlsx")

The dataset was exported from the DoDEA SharePoint-based LWT platform on April 12, 2026. It contains 4003 records collected between August 27, 2025 and April 28, 2026 across 27 schools and 83 unique observers.

Key variables used in this analysis are described in the table below.

Variable	Type	Role
`Recognitions and Considerations`	Free text	Primary corpus
`SF1` through `SF13`	Categorical (Observed / Not Observed / Not Applicable)	Outcome labels for classification
`Observer`	Character	Rater identity covariate
`School2`	Character	School-level grouping
`Date_and_Time`	Datetime	Temporal stratification
`LWT Purpose`	Categorical	Analysis filter

Show code

lwt_raw |>
  count(School2, sort = TRUE) |>
  slice_head(n = 15) |>
  ggplot(aes(x = n, y = fct_reorder(School2, n))) +
  geom_col(fill = "#2166ac") +
  labs(
    title    = "Observation Records by School",
    subtitle = "Top 15 schools, SY 2025-26",
    x        = "Number of Records",
    y        = NULL
  ) +
  theme_lwt()

The distribution is reasonably balanced across the top schools, with Crossroads ES, Johnson PS, and Shughart ES each contributing approximately 200 records. No single school dominates the corpus, which limits the risk that school-level idiosyncrasies will drive corpus-wide findings.

Wrangling

Filtering and Initial Cleaning

Analysis is restricted to Teacher Feedback records with non-missing narratives. The 160 Calibration records are excluded because they are not feedback to teachers and would introduce a qualitatively different discourse type into the corpus.

Show code

lwt <- lwt_raw |>
  filter(
    `LWT Purpose` == "Teacher Feedback",
    !is.na(`Recognitions and Considerations`)
  ) |>
  mutate(
    doc_id   = row_number(),
    obs_date = as.Date(Date_and_Time),
    month    = floor_date(obs_date, "month"),
    text_raw = `Recognitions and Considerations`
  )

After filtering, 3843 records remain for analysis.

Salutation Removal

Observer narratives frequently open with a formal salutation that is semantically uninformative and would inflate term frequencies for words like pleasure, visit, and classroom. A regex pattern strips salutations and closing formulas before tokenization.

Show code

salutation_pattern <- "^(Dear|Hi|Hello)\\s+[^,\\.\\n]+[,\\.][^\\n]*\\n?"
closing_pattern    <- "(Respectfully|Sincerely|Best regards|Kind regards)[,\\.\\s\\S]*$"

lwt <- lwt |>
  mutate(
    text_clean = str_remove(text_raw,  regex(salutation_pattern, ignore_case = TRUE)),
    text_clean = str_remove(text_clean, regex(closing_pattern,    ignore_case = TRUE)),
    text_clean = str_squish(text_clean)
  )

Tokenization, Stop Word Removal, and Lemmatization

Tokenization proceeds at the word level. Standard English stop words (Snowball lexicon) are removed along with a custom domain stop word list targeting high-frequency but low-information terms specific to instructional observation feedback. Lemmatization is applied in preference to stemming to preserve keyword interpretability in topic model outputs.

Steps excluded and rationale: (1) Stemming was excluded because over-stemming reduces interpretability and the practitioner audience needs to recognize terms in outputs. (2) Negations were retained as distinct tokens, since phrases like “not observed” carry meaning in this context. (3) Short comments were not filtered because brief coaching notes may contain concentrated instructional signal.

Show code

custom_stops <- tibble(word = c(
  "teacher", "teachers", "student", "students", "classroom", "class",
  "lesson", "learning", "observed", "observation", "visit", "school",
  "ms", "mr", "mrs", "dr", "clearly", "evident", "continue", "also",
  "well", "great", "good", "nice", "wonderful", "excellent", "amazing",
  "truly", "pleasure", "visiting", "thank", "allowing", "opportunity",
  "observe", "cultivated", "strong", "culture", "mutual", "respect",
  "always", "joy", "dear", "respectfully", "sincerely", "one", "make",
  "use", "able", "will", "would", "could", "may", "must", "need",
  "see", "look", "come", "go", "get", "give", "take", "keep", "let",
  "put", "set", "way", "time", "day", "year", "work", "working",
  "used", "using", "like", "even", "still", "already", "much", "many",
  "first", "last", "new", "old", "high", "low", "large", "small",
  "long", "little", "own", "right"
))

all_stops <- bind_rows(stop_words, custom_stops |> mutate(lexicon = "custom"))

lwt_tokens <- lwt |>
  select(doc_id, School2, Observer, month, text_clean) |>
  unnest_tokens(word, text_clean) |>
  filter(!str_detect(word, "^[0-9]+$")) |>
  anti_join(all_stops, by = "word") |>
  mutate(word = lemmatize_words(word)) |>
  filter(nchar(word) > 2)

New Variables and Data Structures

Two new variables are created: doc_id (a sequential integer row identifier used to join tokens back to documents) and month (a floor-date version of the observation timestamp used for temporal aggregation). A document-term matrix is constructed after removing terms appearing in fewer than 10 documents or more than 80% of documents. A separate TF-IDF matrix is aggregated at the observer and school levels for the inter-rater variation analysis.

Show code

word_counts <- lwt_tokens |>
  count(doc_id, word, sort = TRUE)

doc_freq <- word_counts |> count(word, name = "df")
n_docs   <- n_distinct(word_counts$doc_id)

words_to_keep <- doc_freq |>
  filter(df >= 10, df <= 0.80 * n_docs) |>
  pull(word)

lwt_dtm <- word_counts |>
  filter(word %in% words_to_keep) |>
  cast_dtm(doc_id, word, n)

Show code

observer_tfidf <- lwt_tokens |>
  count(Observer, word, sort = TRUE) |>
  bind_tf_idf(word, Observer, n)

school_tfidf <- lwt_tokens |>
  count(School2, word, sort = TRUE) |>
  bind_tf_idf(word, School2, n)

top_schools <- lwt |>
  count(School2, sort = TRUE) |>
  slice_head(n = 8) |>
  pull(School2)

Analysis

Section 1: Observer and School Language Variation (TF-IDF)

TF-IDF identifies terms distinctively associated with a specific observer or school relative to the full corpus. If observers use vocabulary in ways that reflect idiosyncratic writing styles rather than genuine instructional differences, that constitutes construct-irrelevant variance in the LWT record. The six highest-volume observers are examined.

Show code

top_observers <- lwt |>
  count(Observer, sort = TRUE) |>
  slice_head(n = 6) |>
  pull(Observer)

observer_codes <- lwt |>
  count(Observer, sort = TRUE) |>
  slice_head(n = 6) |>
  mutate(label = paste0("Observer ", LETTERS[row_number()])) |>
  select(Observer, label)

observer_tfidf |>
  filter(Observer %in% top_observers) |>
  left_join(observer_codes, by = "Observer") |>
  group_by(label) |>
  slice_max(tf_idf, n = 8) |>
  ungroup() |>
  ggplot(aes(x = tf_idf, y = reorder_within(word, tf_idf, label))) +
  geom_col(fill = "#4393c3") +
  scale_y_reordered() +
  facet_wrap(~label, scales = "free_y", ncol = 2) +
  labs(
    title    = "Distinctive Vocabulary by Observer (TF-IDF)",
    subtitle = "Top 8 terms per observer; six highest-volume observers shown",
    x        = "TF-IDF",
    y        = NULL
  ) +
  theme_lwt()

The observer-level TF-IDF reveals meaningful differentiation in writing style and instructional focus. Four patterns emerge. First, one observer’s distinctive terms reflect a formal evaluative register and consistent attention to academic discourse (terms like commended, discourse, refinement, conducive, and establishing), which maps directly onto the SF framework. Second, another observer clusters heavily around early literacy content (phonics, benchmark, sound, criteria), suggesting a primary assignment to elementary literacy walkthroughs. Third, a third observer’s vocabulary centers on communication process language (importance, communicating, reminding), substantively interpretable but reflecting a different instructional lens than discourse-focused peers. Fourth, and most notably, one observer’s distinctive terms include walkthrough, principal, the school name, booking, and link, which are administrative and self-referential rather than instructional. These terms describe the observation process itself rather than classroom practice, representing a qualitatively different genre of writing. This is a reliability concern: if one observer’s records are primarily procedural, their SF ratings may not be generated through the same inferential process as those of other observers.

Show code

school_tfidf |>
  filter(School2 %in% top_schools) |>
  group_by(School2) |>
  slice_max(tf_idf, n = 6) |>
  ungroup() |>
  ggplot(aes(x = tf_idf, y = reorder_within(word, tf_idf, School2))) +
  geom_col(fill = "#2166ac") +
  scale_y_reordered() +
  facet_wrap(~School2, scales = "free_y", ncol = 2) +
  labs(
    title    = "Distinctive Vocabulary by School (TF-IDF)",
    subtitle = "Top 6 terms per school; eight highest-volume schools shown",
    x        = "TF-IDF",
    y        = NULL
  ) +
  theme_lwt()

School-level TF-IDF is largely interpretable as a function of grade band and content area. Schools with K-2 concentration show distinctive phonics and early literacy vocabulary. Schools serving upper elementary grades show more writing, evidence, and reasoning vocabulary. This is the expected pattern and provides evidence that observer narratives are tracking genuine instructional differences across schools rather than purely reflecting observer preference.

Section 2: LDA Topic Modeling

LDA topic modeling is applied to identify latent themes in the LWT narrative corpus. The number of topics is selected by evaluating model perplexity on a held-out 20% of documents across k values of 5, 8, 10, 12, and 15.

Show code

set.seed(42)

n_train   <- floor(0.8 * nrow(lwt_dtm))
train_idx <- sample(seq_len(nrow(lwt_dtm)), n_train)
dtm_train <- lwt_dtm[train_idx, ]
dtm_test  <- lwt_dtm[-train_idx, ]

k_values     <- c(5, 8, 10, 12, 15)
perplexities <- map_dbl(k_values, function(k) {
  model <- LDA(dtm_train, k = k, control = list(seed = 42))
  perplexity(model, newdata = dtm_test)
})

print(
  tibble(k = k_values, perplexity = perplexities) |>
    ggplot(aes(x = k, y = perplexity)) +
    geom_line(color = "#2166ac", linewidth = 1) +
    geom_point(size = 3, color = "#2166ac") +
    labs(
      title    = "LDA Perplexity by Number of Topics",
      subtitle = "Lower perplexity indicates better fit on held-out documents",
      x        = "Number of Topics (k)",
      y        = "Perplexity"
    ) +
    theme_lwt()
)

Perplexity decreases substantially from k=5 to k=10 and then flattens, suggesting that 10 topics captures the primary thematic structure of the corpus without overfitting. The model is fit at k=10.

Show code

lda_model <- LDA(lwt_dtm, k = 10, control = list(seed = 42))

Show code

lda_topics <- tidy(lda_model, matrix = "beta")

print(
  lda_topics |>
    filter(topic %in% 1:5) |>
    group_by(topic) |>
    slice_max(beta, n = 10) |>
    ungroup() |>
    mutate(topic_label = label_lookup[as.character(topic)]) |>
    ggplot(aes(x = beta, y = reorder_within(term, beta, topic_label))) +
    geom_col(fill = "#4393c3") +
    scale_y_reordered() +
    facet_wrap(~topic_label, scales = "free_y", ncol = 2) +
    labs(
      title    = "Top Terms per LDA Topic (Topics 1-5)",
      subtitle = "k = 10; term-topic probability (beta)",
      x        = "Beta",
      y        = NULL
    ) +
    theme_lwt()
)

Show code

print(
  lda_topics |>
    filter(topic %in% 6:10) |>
    group_by(topic) |>
    slice_max(beta, n = 10) |>
    ungroup() |>
    mutate(topic_label = label_lookup[as.character(topic)]) |>
    ggplot(aes(x = beta, y = reorder_within(term, beta, topic_label))) +
    geom_col(fill = "#4393c3") +
    scale_y_reordered() +
    facet_wrap(~topic_label, scales = "free_y", ncol = 2) +
    labs(
      title    = "Top Terms per LDA Topic (Topics 6-10)",
      subtitle = "k = 10; term-topic probability (beta)",
      x        = "Beta",
      y        = NULL
    ) +
    theme_lwt()
)

Show code

topic_labels <- tibble(
  topic = 1:10,
  label = c(
    "Observer Process Notes",
    "Early Literacy and Phonics",
    "Math Reasoning and Discourse",
    "Writing and Textual Evidence",
    "Guided Reading and Comprehension",
    "Success Criteria and Stations",
    "Early Childhood Activity Learning",
    "Academic Discourse and Peer Collaboration",
    "Math Goals and Instructional Strategy",
    "Classroom Environment and Routines"
  )
)

doc_topics <- tidy(lda_model, matrix = "gamma") |>
  rename(doc_id = document) |>
  mutate(doc_id = as.integer(doc_id)) |>
  group_by(doc_id) |>
  slice_max(gamma, n = 1) |>
  ungroup() |>
  left_join(topic_labels, by = "topic")

Show code

doc_topics |>
  count(topic, label) |>
  arrange(topic) |>
  kable(
    col.names = c("Topic", "Label", "Documents"),
    caption   = "LDA Topic Labels and Document Counts"
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

LDA Topic Labels and Document Counts
Topic	Label	Documents
1	Observer Process Notes	428
2	Early Literacy and Phonics	503
3	Math Reasoning and Discourse	362
4	Writing and Textual Evidence	323
5	Guided Reading and Comprehension	320
6	Success Criteria and Stations	467
7	Early Childhood Activity Learning	362
8	Academic Discourse and Peer Collaboration	361
9	Math Goals and Instructional Strategy	334
10	Classroom Environment and Routines	376

Eight of the 10 topics map directly onto SF indicator domains. Topics 3, 8, and 9 correspond to academic discourse and mathematical reasoning (SF6, SF8, SF9). Topics 2, 4, and 5 cover content-specific literacy and writing instruction (SF5, SF7). Topic 10 covers classroom environment and routines (SF4). Topic 6 covers success criteria and differentiated station work (SF5, SF6).

Two topics do not have a clear SF counterpart. Topic 1 (“Observer Process Notes”) is characterized by terms like indicator, digital, goals, data, and note that describe the observation process itself rather than classroom instruction. This suggests that a subset of narratives are written partly as administrative record-keeping rather than instructional feedback, an instrument fidelity concern. Topic 7 (“Early Childhood Activity Learning”) captures language specific to play-based and choice-driven early childhood classrooms (morning, choice, play, activity) that the SF framework does not explicitly address with a dedicated indicator. This is a coverage gap in the instrument rather than a flaw in observer writing.

Show code

print(
  doc_topics |>
    left_join(lwt |> select(doc_id, School2), by = "doc_id") |>
    filter(School2 %in% top_schools) |>
    count(School2, label) |>
    group_by(School2) |>
    mutate(pct = n / sum(n)) |>
    ungroup() |>
    ggplot(aes(x = pct, y = School2, fill = label)) +
    geom_col(position = "stack") +
    scale_x_continuous(labels = percent_format()) +
    scale_fill_brewer(palette = "Paired") +
    labs(
      title    = "Topic Distribution by School",
      subtitle = "Proportion of observations assigned to each topic",
      x        = "Proportion",
      y        = NULL,
      fill     = "Topic"
    ) +
    theme_lwt() +
    theme(legend.position = "bottom", legend.text = element_text(size = 7))
)

Topic distributions differ across schools in ways that reflect known structural features of the district. Schools with larger K-2 populations show higher concentrations of Topics 2 and 5 (early literacy, guided reading). Schools with upper elementary focus show higher concentrations of Topics 3, 8, and 9 (reasoning, discourse, math strategy). This school-level topic variation is substantively interpretable and supports the inference that narratives are capturing genuine instructional differences across schools rather than being dominated by observer style alone.

Section 3: Sentiment Trajectory over Time

Lexicon-based sentiment analysis using the AFINN lexicon examines whether observer feedback tone varies across the school year. In instructional observation feedback, sentiment valence is best interpreted as evaluative tone: positive AFINN scores reflect praise and commendation language; negative scores reflect language associated with growth areas or concerns. AFINN values range from -5 to +5 per word, so a document mean above zero reflects a higher density of positively-scored terms relative to negative ones.

Show code

afinn <- get_sentiments("afinn")

sentiment_monthly <- lwt_tokens |>
  inner_join(afinn, by = "word") |>
  group_by(month) |>
  summarise(
    mean_sentiment = mean(value),
    n_tokens       = n(),
    .groups = "drop"
  )

print(
  sentiment_monthly |>
    ggplot(aes(x = month, y = mean_sentiment)) +
    geom_line(color = "#2166ac", linewidth = 1) +
    geom_point(aes(size = n_tokens), color = "#2166ac", alpha = 0.7) +
    geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
    scale_size_continuous(range = c(2, 6), labels = comma_format()) +
    labs(
      title    = "Monthly Mean Sentiment in LWT Narratives",
      subtitle = "AFINN lexicon; point size proportional to token volume",
      x        = NULL,
      y        = "Mean AFINN Sentiment",
      size     = "Tokens"
    ) +
    theme_lwt()
)

Sentiment is uniformly positive across all months, ranging from approximately 6.2 in December to 7.1 in January. This is consistent with the LWT’s purpose as a coaching and feedback instrument: observers are writing to encourage and recognize teachers, so a positive valence baseline is expected and appropriate. The January peak may reflect a mid-year re-energizing dynamic following winter break. The pattern is relatively stable with no sustained decline that would suggest systemic coaching concern.

Show code

print(
  lwt_tokens |>
    inner_join(afinn, by = "word") |>
    filter(School2 %in% top_schools) |>
    group_by(School2) |>
    summarise(mean_sentiment = mean(value), .groups = "drop") |>
    arrange(desc(mean_sentiment)) |>
    ggplot(aes(x = mean_sentiment, y = fct_reorder(School2, mean_sentiment))) +
    geom_col(fill = "#4393c3") +
    geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
    labs(
      title    = "Mean Narrative Sentiment by School",
      subtitle = "AFINN lexicon; eight highest-volume schools",
      x        = "Mean AFINN Sentiment",
      y        = NULL
    ) +
    theme_lwt()
)

Across schools, mean sentiment ranges from approximately 4.7 to 8.2, a spread that is notable. Two interpretations are plausible: narratives at the higher-sentiment schools are genuinely more commendatory because instructional practice there is stronger, or observer assignment patterns mean that more encouragement-oriented observers are more frequently assigned to certain schools. Without a crossed observation design, these explanations cannot be disentangled. The school-level sentiment variation is a finding worth investigating further rather than a conclusion.

Section 4: Predicting SF Ratings from Narrative Text (LASSO)

This analysis functions as an empirical validity check: if an SF indicator is faithfully operationalized in observer writing, the rating for that indicator should be recoverable from the text. SF indicators with high predictability suggest observers are writing about what they are rating. Indicators with poor predictability suggest the rating and the narrative are being generated somewhat independently, a reliability and validity concern.

Analysis is restricted to SF3 through SF9, which have sufficient non-NA records and reasonable Observed/Not Observed balance for binary classification. A LASSO logistic regression model is fit for each SF indicator using TF-IDF features from the narrative. Performance is evaluated using 5-fold cross-validated AUC.

Show code

tfidf_sparse <- word_counts |>
  filter(word %in% words_to_keep) |>
  bind_tf_idf(word, doc_id, n) |>
  cast_sparse(doc_id, word, tf_idf)

Show code

sf_targets <- c("SF3", "SF5", "SF6", "SF7", "SF8", "SF9")

fit_sf_lasso <- function(sf_col, lwt_df, sparse_mat) {
  outcome <- lwt_df |>
    select(doc_id, rating = all_of(sf_col)) |>
    filter(rating %in% c("Observed", "Not Observed")) |>
    mutate(y = as.integer(rating == "Observed"))

  row_ids <- as.integer(rownames(sparse_mat))
  keep    <- row_ids %in% outcome$doc_id
  X       <- sparse_mat[keep, ]
  y       <- outcome$y[match(row_ids[keep], outcome$doc_id)]

  set.seed(42)
  cv_fit <- cv.glmnet(X, y, family = "binomial", alpha = 1,
                      type.measure = "auc", nfolds = 5)

  tibble(sf = sf_col, auc_cv = max(cv_fit$cvm), n_obs = nrow(X), pct_obs = mean(y))
}

auc_results <- map_dfr(sf_targets, fit_sf_lasso,
                       lwt_df     = lwt,
                       sparse_mat = tfidf_sparse)

print(
  auc_results |>
    ggplot(aes(x = auc_cv, y = fct_reorder(sf, auc_cv))) +
    geom_col(fill = "#2166ac") +
    geom_vline(xintercept = 0.5, linetype = "dashed", color = "gray50") +
    geom_text(aes(label = round(auc_cv, 3)), hjust = -0.1, size = 3.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(
      title    = "Cross-Validated AUC by SF Indicator",
      subtitle = "LASSO logistic regression predicting Observed vs. Not Observed from TF-IDF features",
      x        = "AUC (5-fold CV)",
      y        = "Success Factor"
    ) +
    theme_lwt()
)

Show code

sf5_coefs <- {
  outcome <- lwt |>
    select(doc_id, rating = SF5) |>
    filter(rating %in% c("Observed", "Not Observed")) |>
    mutate(y = as.integer(rating == "Observed"))

  row_ids <- as.integer(rownames(tfidf_sparse))
  keep    <- row_ids %in% outcome$doc_id
  X       <- tfidf_sparse[keep, ]
  y       <- outcome$y[match(row_ids[keep], outcome$doc_id)]

  set.seed(42)
  cv_fit   <- cv.glmnet(X, y, family = "binomial", alpha = 1,
                         type.measure = "auc", nfolds = 5)
  coef_mat <- coef(cv_fit, s = "lambda.min")

  tibble(
    term = rownames(coef_mat),
    coef = as.vector(coef_mat)
  ) |>
    filter(term != "(Intercept)", coef != 0) |>
    arrange(desc(abs(coef)))
}

print(
  sf5_coefs |>
    slice_head(n = 20) |>
    mutate(direction = if_else(coef > 0, "Predicts Observed", "Predicts Not Observed")) |>
    ggplot(aes(x = coef, y = reorder(term, coef), fill = direction)) +
    geom_col() +
    scale_fill_manual(values = c("Predicts Observed"     = "#2166ac",
                                 "Predicts Not Observed" = "#d6604d")) +
    labs(
      title    = "Top LASSO Coefficients: SF5 (Engaging Tasks and Rigor)",
      subtitle = "Terms most predictive of Observed vs. Not Observed; lambda.min from 5-fold CV",
      x        = "LASSO Coefficient",
      y        = NULL,
      fill     = NULL
    ) +
    theme_lwt() +
    theme(legend.position = "top")
)

Cross-validated AUC across the six SF indicators ranges from 0.719 (SF7, Learning Progressions) to 0.807 (SF5, Engaging Tasks and Rigor). All six indicators exceed the no-information baseline of 0.50 by a meaningful margin, meaning narrative text carries real predictive signal about the structured rating in every case. Observers are broadly writing about what they are rating.

SF5 (AUC = 0.807) is the most text-recoverable indicator. Its LASSO model shows that terms like commended, engaged, actively, engaging, and effective are strongly associated with an Observed rating, while ensure, help, might, please, and noticed predict Not Observed. This is interpretively coherent: the Observed language is commendatory and present-tense descriptive, while the Not Observed language is forward-looking and suggestive of what the observer wished to see but did not. The SF5 result provides evidence that observers are grounding this particular rating in their narrative.

SF7 (AUC = 0.719, Learning Progressions) is the least text-recoverable indicator. Its lower AUC suggests this rating is assigned with less textual grounding than others, meaning observers may be rating SF7 from general impression or schema rather than directly observed and described evidence. This is worth flagging for instrument review.

Findings

Research Question 1: Topic-SF Alignment

Eight of the 10 LDA topics align to identifiable SF indicator domains. Topics covering academic discourse (Topics 3, 8), content-specific instruction (Topics 2, 4, 5), classroom environment (Topic 10), and goals and success criteria (Topics 6, 9) all have corresponding SF indicators. Two topics do not: Topic 1 (Observer Process Notes) reflects a subset of narratives written as administrative memos rather than instructional feedback, and Topic 7 (Early Childhood Activity Learning) captures a mode of instruction not formally represented in the SF framework. The absence of a dedicated indicator for early childhood play-based learning is a coverage gap, particularly given the district’s K-2 enrollment.

Research Question 2: Observer and School Language Variation

Observer TF-IDF analysis reveals substantive variation in writing focus that is largely, but not entirely, interpretable as genuine instructional attention. Most observer vocabulary differences reflect reasonable specialization (literacy vs. discourse vs. environment). One observer’s distinctive vocabulary is primarily procedural and self-referential rather than instructional, which is a reliability concern for that observer’s records specifically. School-level vocabulary differences track grade band and content area in expected ways, providing evidence that narratives capture real instructional differences across schools.

Research Question 3: SF Predictability from Text

All six SF indicators examined are predictable from narrative text above the no-information baseline. AUC ranges from 0.719 to 0.807. SF5 (Engaging Tasks and Rigor) is the most text-aligned rating, with LASSO coefficients that are clearly interpretable as commendatory versus growth-oriented language. SF7 (Learning Progressions) is the least text-aligned, suggesting this indicator may be rated through inference rather than documented observed evidence.

Recommendations

Four actionable recommendations follow from these findings.

1. Address the observer narrative genre problem. One high-volume observer’s records are dominated by procedural and administrative language rather than instructional feedback. District leadership should review those records and work with the observer to clarify the LWT’s narrative purpose. A brief norming protocol focused on narrative content expectations would reduce this source of construct-irrelevant variance across the record set.

2. Add a Success Factor for early childhood learning environments. Topic 7 is a recurring thematic cluster with no corresponding SF indicator. Observers are writing about play-based choice, morning routines, and activity design in ways that do not map onto the current 13-indicator framework. Adding or refining a K-2-specific indicator would improve instrument coverage and give coaches a structured way to rate what they are already observing and describing.

3. Prioritize SF7 for calibration. Learning Progressions (SF7) is the least text-recoverable indicator in the classification analysis, suggesting observers may be assigning this rating from general impression rather than specific observed evidence. A calibration session with anchor examples defining what observable evidence grounds an SF7 Observed rating would improve rater consistency across schools.

4. Use topic distribution data for school-level instructional planning. The school-level topic distribution analysis shows interpretable differences in instructional focus. This analysis can be rerun at any time with an updated LWT export and shared with curriculum coordinators as a complement to structured SF summary reports. Schools with low concentration in discourse-focused topics despite serving upper elementary grades may represent priority targets for coaching support in academic language development.

Limitations and Ethics

Several limitations bound interpretation of these findings. The corpus is single-district and single-year; findings are not generalizable to other DoDEA regions, other school years, or other observation instruments. Observer identity is confounded with school assignment, meaning school-level differences in sentiment and topic distribution cannot be cleanly attributed to instructional conditions rather than observer preference without a crossed observation design.

The narrative field is a measure of what observers noticed and chose to write, not a direct measure of instruction. LDA topics and AFINN sentiment values characterize observer attention and tone, not classroom reality. Particularly in the sentiment analysis, a school-level sentiment difference may reflect the observer assigned to that school as much as the instruction occurring there.

AFINN was developed on general English text and has not been validated for instructional observation discourse. Sentiment values in this context should be interpreted as evaluative tone rather than emotional valence in the conventional sense. Lemmatization relies on the textstem package via WordNet; domain-specific terms may be lemmatized imperfectly.

The dataset contains identifiable teacher names in the Person Observed field. All analysis was conducted with teacher identifiers replaced by coded document IDs, and no individual-level results are reported. This analysis was conducted under existing DoDEA data governance agreements covering analytical use of operational observation data.

References

Bryk, A. S. (2015). Accelerating how we learn to improve. Educational Researcher, 44(9), 467-477.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.

Krumm, A., Means, B., and Bienkowski, M. (2018). Learning analytics goes to school: A collaborative approach to improving education. Routledge.

Nielsen, F. A. (2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on Making Sense of Microposts.

Silge, J. and Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media. https://www.tidytextmining.com

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267-288.