Abstract
This project reports a methodological advancement on how topic
modeling can be used to analyze qualitative data from transcripts of
focus groups conducted with participants in an international
professional development program. The program is designed to prepare
inservice educators for culturally-responsive teaching (CRT) through
practice with the development of technical representations of cultural
themes in an international context. Themes developed using topic
modeling can be compared to existing thematic analysis conducted by
researchers on the same data. This will provide insight into the
strengths and weaknesses of each approach and demonstrate the potential
of topic modeling to enhance qualitative data analysis. This will be
useful for future researchers interested in the interaction between
qualitative methodology and topic modeling.
1. PREPARE
1a. Data sources used for text analysis
Context: Professional Development Abroad
The data are from an ongoing, international, professional development
program for inservice teachers. It includes three Saturday classes in
the spring, two weeks of a study abroad program in the summer to an
international destination (in 2022, to Munich Germany) and one follow up
session in the fall. In this professional development abroad, teachers
build a portfolio that represents a chosen cultural theme, including a
lesson plan illustrating how they plan to apply technical cultural
representation and/or analysis in their own classrooms. The program aims
to provide teachers with experiences practicing with cultural frames and
representational tools, so they can work with their own students to
elicit and represent diverse cultural identities and perspectives.
Data Collection and Analysis
First, using qualitative methods, the research team analyzed data
collected during focus groups, from digital portfolios, and from
observations during the professional development program. Data collected
consisted of researcher memos, projects and artifacts created by 19
teachers, and post-reflective focus groups after the professional
development experience. Using the transcripts from the focus groups,
participants’ portfolios, and researcher observations, Braun and
Clarke’s (2022) process of reflexive thematic analysis to analyze focus
group data was used.
Initial findings from thematic analysis of focus group data indicated
that
1.) Self-selected digital projects allowed participants to develop
professional knowledge related to their specific
teaching content, curriculum, and classroom;
2.) International immersive experience with targeted professional
development provided an opportunity for reflections on
pandemic/post-pandemic instruction; and
3.) Teachers reflected that the experience had an impact on their
teaching dispositions and plans for the future.
This project will outline a second step in this process, in which the
same data will be analyzed using topic modeling. Topic modeling is a
machine learning technique that automatically analyzes text data to
determine cluster words for a set of documents. This is known as
‘unsupervised’ machine learning because it doesn’t require a predefined
list of tags or training data that’s been previously classified by
humans. This additional step in analyzing the data advances the field of
qualitative research as it offers a means of reflecting on the
comparison of the two sets of results.
1b. Guiding Question
With the increasing amount of qualitative data being generated in
research studies, it is essential to have effective methods for
analyzing and interpreting this information. The use of topic modeling
in the analysis of textual data has gained popularity in recent years
due to its ability to identify hidden themes and patterns within large
datasets. This project describes a potential methodological advancement
on how topic modeling can be used to analyze qualitative data.
Thus, the primary research question is:
What does topic modeling in the analysis of textual data from
focus groups with inservice teachers participating in international
professional development reveal about how topic modeling can enhance
qualitative research data?
2. WRANGLE
To analyze the textual data using topic modeling, the workflow of the
process will include data wrangling, modeling, and exploration as
follows:
The data will be wrangled, a process which involves some combination
of cleaning, reshaping, transforming, and merging data (Wickham &
Grolemund, 2017). The text will be preprocessed, including converting
the data into a tidy text format. This includes data tokenization,
removal of stop words, and stemming.
The text will then be analyzed by fitting a topic modeling algorithm
using Latent Dirichlet Allocation (LDA). Then beta values were created
to explore the findings. A number of topics were selected. This process
was repeated until the appropriate number of topics was decided upon.
While topic modeling involves many decisions and can be as much art as
science (Bail, 2018), the purpose of topic modeling in this context was
to develop a simple mathematical summary of the dataset which can help
further explore trends and patterns in the data.
2a. Project Set Up and Import Interview Data
library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
library(ggplot2)
library(dplyr)
After setting up the project in R, the data will be imported into
R.
CIDRE_interviews <- read_csv("data/Independent Analysis April 18 2023.csv",
col_types = cols(
Interview = col_character(),
Interviewee = col_character(),
Topic = col_character()
)
)
CIDRE_interviews <- CIDRE_interviews %>%
mutate(Interview = strsplit(as.character(Interview), "\n\n\n")) %>%
unnest(Interview)
CIDRE_interviews
## # A tibble: 333 × 3
## Interview Inter…¹ Topic
## <chr> <chr> <chr>
## 1 "Next is Chris and my picture. Let me make it bigger. Okay, th… Chris … <NA>
## 2 "making sure that she remembered in the photo that they were" Chris … <NA>
## 3 "…so this is the German theatre Museum. This one I went to on … Chris … <NA>
## 4 "Those are very grown up answers." Chris … <NA>
## 5 "We have to be grown up at some point.\n" Chris … <NA>
## 6 "\nThis is my photo of my personal experience. It was taken by… Kaitli… <NA>
## 7 "for the tape No it's a very it's a very like peaceful setting… Kaitli… <NA>
## 8 "Yeah, for the tape, it’s a swan, surrounded by its own feces … Kaitli… <NA>
## 9 "I feel badly that I already bailed on the public school syste… Karen … <NA>
## 10 "you did so much photographing. Do you have some favorite phot… Karen … <NA>
## # … with 323 more rows, and abbreviated variable name ¹Interviewee
2b. Cast a Document Term Matrix
The text was preprocessed, including converting the data into a tidy
text format. This included data tokenization and removal of stop
words.
First, the text was tokenized and stop words were removed. Additional
stop words were added such as: like, just, well, yeah, lot, stuff,
gonna
interviews_tidy <- CIDRE_interviews %>%
unnest_tokens(output = word, input = Interview) %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "lot"& !word == "stuff"& !word == "gonna"& !word == "like"& !word == "just"& !word == "well"& !word == "yeah")
interviews_tidy
## # A tibble: 8,136 × 3
## Interviewee Topic word
## <chr> <chr> <chr>
## 1 Chris Alston <NA> chris
## 2 Chris Alston <NA> picture
## 3 Chris Alston <NA> bigger
## 4 Chris Alston <NA> picture
## 5 Chris Alston <NA> image
## 6 Chris Alston <NA> modern
## 7 Chris Alston <NA> art
## 8 Chris Alston <NA> museum
## 9 Chris Alston <NA> floor
## 10 Chris Alston <NA> chairs
## # … with 8,126 more rows
A word count to determine the most common words in the interviews was
conducted. These word counts would be later used for creating a document
term matrix for topic modeling.
interviews_tidy %>%
count(word, sort = TRUE)
## # A tibble: 2,759 × 2
## word n
## <chr> <int>
## 1 people 117
## 2 feel 68
## 3 time 68
## 4 experience 65
## 5 cool 45
## 6 world 45
## 7 kids 43
## 8 germany 42
## 9 picture 41
## 10 students 39
## # … with 2,749 more rows
The terms “people,” “feel,” “time,” and “experience” emerged as most
common, reflecting that the study topic was an international learning
experience. The terms “community,” “connections” and “professional” are
worth attention as well. The terms “white” with 14 instances and
“black,” with 13 instances suggests the potential to to explore a
potential theme of race.
Creating a Document Term Matrix
Each interview group was treated as a unique document, with a total
of ten documents with 2759 terms. Using the existing word counts, a
matrix was created that contained a column for each word in the corpus
and a value of n for how many times that word occurs in each post.
To create this document term matrix from the interview counts, the
cast_dtm() function was used and assigned to the variable
interviews_dtm.
interviews_dtm <- interviews_tidy %>%
count(Topic, word) %>%
cast_dtm(Topic, word, n)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
## <<DocumentTermMatrix (documents: 10, terms: 2759)>>
## Non-/sparse entries: 3128/24462
## Sparsity : 89%
## Maximal term length: 17
## Weighting : term frequency (tf)
2c. Preprocessing and (not) Stemming
Next the original data set for structural topic modeling was prepared
using the textProcessor() function to remove punctuation
elements and stop words to simplify results.
temp <- textProcessor(CIDRE_interviews$Interview,
metadata = CIDRE_interviews,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
Stemming was considered as a preprocessing step to reduce the size of
the vocabulary in natural language and thus simplify the model. Stemming
reduced the number of terms from 2920 to 2241. This reduction in corpus
size of 679 terms does not justify the risk of losing subtleties in
meaning between words with similar word stems, such as “support” and
“supported” that would be relevant to the findings of the study.
Therefore, this project does not use stemming.
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
3. MODEL
The text was modeled to create a mathematical summary of the dataset.
The resulting summaries can help trends and patterns in the data to
surface, particularly when compared to thematic analysis. Topic Modeling
is an unsupervised learning approach that can provide insight into the
structure of the dataset.
Three steps were undertaken:
- Fitting a Topic Modeling with LDA. The
topicmodels package and associated LDA()
function for unsupervised classification of the interview data was used
to find natural groupings of words, or topics.
- Fitting a Structural Topic Model. The
stm package and stm() function were used to
fit our model and used metadata about documents to improve the
assignment of words to “topics” in the corpus.
- Choosing K. Finally, an appropriate number of
topics was selected.
3a. Fitting a Topic Modeling with LDA
The LDA function or Latent Dirichlet allocation was used because
every document contains a mixture of topics and every topic contains a
mixture of words. This means that a focus group interview could have an
estimated topic proportion of 80% for Topic 1 but also be partly about
topic 2. Likewise, words can be shared between topics and words germane
to the topic such as “community” and “language” might appear in an
individual topic equally.
LDA requires a k value to be specified for the number of topics in
the focus group interviews. K was selected as 20 as a starting point. K
was then run as 15 by way of comparison.
interviews_lda <- LDA(interviews_dtm,
k = 20,
control = list(seed = 588)
)
interviews_lda
## A LDA_VEM topic model with 20 topics.
ap_topics <- tidy(interviews_lda, matrix = "beta")
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
slice_max(beta, n = 6) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
LDA employs the Term Frequency-Inverse Document Frequency (TF-IDF)
metric to assign probabilities.
3b. Fitting a Structural Topic Model
Bail (2018) argues that one reason STM has rising in popularity and
use is that it employs meta data about documents to improve the
assignment of words to topics in a corpus and that can be used to
examine relationships between covariates and documents. This was useful
for this project in confirming that topics related to themes revealed in
qualtitative data.
The stm Package
Before fitting an STM model, it was necessary to extract the
following elements:
docs <- temp$documents
meta <- temp$meta
vocab <- temp$vocab
These elements were then used to fit the model using the same topics
for K that were specified for the LDA topic model.
interviews_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
K=20,
max.em.its=25,
verbose = FALSE)
interviews_stm
## A topic model with 20 topics, 308 documents and a 2329 word dictionary.
The function allows for viewing the most probable words assigned to
each topic.
plot.STM(interviews_stm, n = 5)

plot(interviews_stm, n = 5)

3c. Finding K
As alluded to earlier, selecting the number of topics for your model
is a non-trivial decision and can dramatically impact your results. Bail
(2018) notes that
The results of topic models should not be over-interpreted unless
the researcher has strong theoretical apriori about the number of topics
in a given corpus, or if the researcher has carefully validated the
results of a topic model using both the quantitative and qualitative
techniques described above.
The FindTopicsNumber Function
The ldatuning package was used to assist with finding K
value.
k_metrics <- FindTopicsNumber(
interviews_dtm,
topics = seq(10, 75, by = 5),
metrics = "Griffiths2004",
method = "Gibbs",
control = list(),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL
)
FindTopicsNumber_plot(k_metrics)
Note that the FindTopicNumbers() function contains three
additional metrics for calculating metrics that can be used to estimate
the most preferable number of topics for LDA model. We used the
Griffiths2004 metrics included in the default example and I’ve also
found this to produce the most interpretable results as show in the
figure below:

As a general rule of thumb and overly simplistic heuristic, we’re
looking for an inflection point in our plot which indicates an optimal
number of topics to select for a value of K.
The LDAvis Explorer
The LDAvis explorere was used to expore topic and word
distributions.
toLDAvis(mod = interviews_stm, docs = docs)
## Loading required namespace: servr
4. EXPLORE & MODEL
Silge and Robinson (2018) note that fitting at topic model is the
“easy part.” The hard part is making sense of the model results and that
the rest of the analysis involves exploring and interpreting the model
using a variety of approaches which we’ll walkthrough in in this
section.
Bail (2018) cautions, however, that:
…post-hoc interpretation of topic models is rather dangerous… and
can quickly come to resemble the process of “reading tea leaves,” or
finding meaning in patterns that are in fact quite arbitrary or even
random.
4a. Exploring Beta Values
The 5 most likely terms assigned to each topic were explored. These
per-topic-per-word probabilities, or β (“beta”) values provide the
probability of a term (word) belonging to a topic.
terms(interviews_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "people" "world" "wearing" "happened" "connections"
## [2,] "time" "lesson" "people" "culture" "teachers"
## [3,] "students" "garden" "traditional" "world" "traditions"
## [4,] "experience" "sustainable" "clothing" "understanding" "teaching"
## [5,] "world" "plan" "lederhosen" "learned" "program"
## Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11
## [1,] "people" "people" "time" "people" "experience" "people"
## [2,] "time" "time" "experience" "feel" "people" "experience"
## [3,] "picture" "world" "people" "experience" "time" "kids"
## [4,] "germany" "picture" "cool" "kids" "cool" "pictures"
## [5,] "kids" "guess" "read" "cool" "germany" "feel"
## Topic 12 Topic 13 Topic 14 Topic 15 Topic 16
## [1,] "feel" "cool" "similar" "cream" "time"
## [2,] "people" "world" "love" "picture" "experience"
## [3,] "experience" "favorite" "opportunities" "ice" "kids"
## [4,] "german" "padlet" "combination" "professional" "picture"
## [5,] "museum" "war" "learner" "museum" "thinking"
## Topic 17 Topic 18 Topic 19 Topic 20
## [1,] "people" "people" "people" "time"
## [2,] "time" "experience" "germany" "feel"
## [3,] "thinking" "picture" "guess" "cool"
## [4,] "trip" "kids" "talking" "students"
## [5,] "pictures" "world" "kids" "pictures"
Topic 10 (experience, time, people, cool, Germany) seems to be about
participants experience in Germany.This correlates with the kind of data
found in qualitative research on the same data.
tidy_lda <- tidy(interviews_lda)
tidy_lda
## # A tibble: 55,180 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 american 1.15e- 3
## 2 2 american 6.21e-42
## 3 3 american 4.83e-42
## 4 4 american 2.54e-37
## 5 5 american 8.77e- 3
## 6 6 american 7.09e- 4
## 7 7 american 9.35e- 5
## 8 8 american 2.40e- 4
## 9 9 american 2.08e- 3
## 10 10 american 1.10e- 3
## # … with 55,170 more rows
top_terms <- tidy_lda %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")

5. COMMUNICATE
As Silge and Robinson note (2018), fitting the topic model is easy
relative to the more challenging work of interpreting the model
results.
Comparing results of thematic analysis conducted by researchers to
topic modeling conducted in R can provide insights into the limitations
and benefits of both forms of qualitative research. As Bail (2018)
warns, “…post-hoc interpretation of topic models is rather dangerous…
and can quickly come to resemble the process of ‘reading tea leaves,’ or
finding meaning in patterns that are in fact quite arbitrary or even
random.”
The same description could be applied to thematic analysis, a process
which is creative, time consuming, and can reflect the biases of the
researchers. By comparing the results of both approaches, researchers
and practitioners can gain a deeper understanding of the data and the
context in which it was collected. This can lead to transformative
insights into how qualitative research can be conducted in a more
rigorous and systematic manner, while still allowing for the creativity
and subjectivity that are inherent in this type of research. Moreover,
reflecting on the limitations and benefits of both thematic analysis and
topic modeling can help researchers and practitioners to identify areas
for improvement in their own research practices, as well as provide
guidance for future research studies. Ultimately, this can lead to more
robust and meaningful qualitative research that contributes to the
advancement of knowledge in a range of fields.
Initial findings from thematic analysis of focus group data indicated
that
1.) Self-selected digital projects allowed participants to develop
professional knowledge related to their specific
teaching content, curriculum, and classroom;
2.) International immersive experience with targeted professional
development provided an opportunity for reflections on
pandemic/post-pandemic instruction; and
3.) Teachers reflected that the experience had an impact on their
teaching dispositions and plans for the future.
Comparing topics to intitial findings from thematic analysis suggest
that topic analysis reflects general, superficial interperetations of
the expeirence. Words such as “cool” “experience” “people” “feel”
reflect a postive, but superficial experience in Germany. Thematic
analysis, conducted by researchers with personal relationships with
participants suggests a nuanced, more sophisticated reading of the data.
This anaylsis reflects outcomes from the experience with suggest deep
change in teacher values and practices which do not seem to be suggested
from topic modeling. While the text minign approach allows for analysis
of a large corpus of data, qualitative reseearch from a team offers a
more in-depth analysis. The two have value together in that topic
modeling can be used with a larger corpus of data and can offer a
confirmation of the general direction of themes created in thematic
analysis. The use of topic modeling in addition to thematic analysis can
provide a form of triangulation valuable to the credibility of a
qualitative research study.
Experience examining which words tend to follow others immediately,
or that tend to co-occur within the same documents.
Comparing results of thematic analysis conducted by researchers to
topic modeling conducted in R can provide insights into the limitations
and benefits of both forms of qualitative research. As Bail (2018)
warns, “…post-hoc interpretation of topic models is rather dangerous…
and can quickly come to resemble the process of ‘reading tea leaves,’ or
finding meaning in patterns that are in fact quite arbitrary or even
random.” The same description could be applied to thematic analysis, a
process which is creative, time consuming, and can reflect the biases of
the researchers. By comparing the results of both approaches,
researchers and practitioners can gain a deeper understanding of the
data and the context in which it was collected. This can lead to
transformative insights into how qualitative research can be conducted
in a more rigorous and systematic manner, while still allowing for the
creativity and subjectivity that are inherent in this type of research.
Moreover, reflecting on the limitations and benefits of both thematic
analysis and topic modeling can help researchers and practitioners to
identify areas for improvement in their own research practices, as well
as provide guidance for future research studies. Ultimately, this can
lead to more robust and meaningful qualitative research that contributes
to the advancement of knowledge in a range of fields.
References
Braun, V., & Clarke, V. (2022). Conceptual and design thinking
for thematic analysis. Qualitative Psychology, 9(1), 3–26. https://doi.org/10.1037/qup0000196
Bail, C. (2018). Strengths and weaknesses of text as data. Retrieved
from https://sicss.io/2019/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html
Gillies, M., Murthy, D., Brenton, H., & Olaniyan, R. (2022).
Theme and topic: How qualitative research and topic modeling can be
brought together. arXiv preprint arXiv:2210.00707. https://doi.org/10.48550/arXiv.2210.00707
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy
approach. O’Reilly Media, Inc. Retrieved from: https://www.tidytextmining.com/topicmodeling.html
Wickham, H. & Grolemund, G. (2017). R for Data Science: Import,
Tidy, Transform, Visualize, and Model Data. O’Reilly Media,
Inc. Retrieved from https://r4ds.had.co.nz/
