This project reports a methodological advancement on how topic modeling can be used to analyze qualitative data from transcripts of focus groups conducted with participants in an international professional development program. The program is designed to prepare inservice educators for culturally-responsive teaching (CRT) through practice with the development of technical representations of cultural themes in an international context. Themes developed using topic modeling can be compared to existing thematic analysis conducted by researchers on the same data. This will provide insight into the strengths and weaknesses of each approach and demonstrate the potential of topic modeling to enhance qualitative data analysis. This project offers a workflow for future researchers interested in the interaction between qualitative methodology and topic modeling.
Thematic analysis and topic modeling both aim to identify a number of underlying themes in text data but they differ in how they achieve this. One way of looking at these two forms of research is the age-old “man vs. machine” argument. In this construct, thematic analysis represents “man,” while topic modeling represents the “machine.” Topic modeling is an automated, algorithmic method that creates “topics” on a given text. “Topics” are terms that frequently occur together and tend to be about the same subject. Thematic analysis is a process by which researchers create topics or “themes” based on their own judgments. It is a highly interactive process involving subjective human interpretation. Both forms have strengths and weaknesses. This paper proposes that these forms of research are not opposing forces, such as man vs machine, but rather can be used together. Specifically, qualitative research can be a workflow starting point, with topic modeling woven throughout as a means of strengthening findings.
While extensive research has been conducted on the uses of thematic analysis and topic modeling in separate research projects, less research examines their interaction and the potential for using both together. In this project, a study is shared in which qualitative thematic analysis was used as a starting point and a topic model was compared. A suggestion of future research approaches in which interactive machine learning is used to combine some of the benefits of both qualitative research and topic modeling.
Thematic analysis can provide nuanced analysis of complex phenomena (Blei, 2012). Researchers read the data as a whole to get an overall sense of what is being said and then apply “codes” to important passages. The codes are words or short phrases that identify the topic of the text. Then researchers review their initial codes to create higher level themes (Braun & Clark, 2022). However, it is a labor intensive process requiring a very close reading of the data by a human researcher. The time and expertise required make it impossible to scale up to the volume of “big data.”
Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans. Unlike thematic analysis, topic modeling is an automated method that is very well suited to big data, but which lacks the nuance of human interpretation. For example, LDA, or Latent Dirichlet allocation, which is used in this study, is essentially a process of counting words and therefore does not take into account a contextual understanding as in qualitative research.
Context: Professional Development Abroad
The data are from an ongoing, international, professional development program for inservice teachers. It includes three Saturday classes in the spring, two weeks of a study abroad program in the summer to an international destination (in 2022, to Munich Germany) and one follow up session in the fall. In this professional development abroad, teachers build a portfolio that represents a chosen cultural theme, including a lesson plan illustrating how they plan to apply technical cultural representation and/or analysis in their own classrooms. The program aims to provide teachers with experiences practicing with cultural frames and representational tools, so they can work with their own students to elicit and represent diverse cultural identities and perspectives.
Data Collection and Analysis
First, using qualitative methods, the research team analyzed data collected during focus groups, from digital portfolios, and from observations during the professional development program. Data collected consisted of researcher memos, projects and artifacts created by 19 teachers, and post-reflective focus groups after the professional development experience. Using the transcripts from the focus groups, participants’ portfolios, and researcher observations, Braun and Clarke’s (2022) process of reflexive thematic analysis to analyze focus group data was used.
Initial findings from thematic analysis of focus group data indicated that
This project will outline a second step in this process, in which the same data will be analyzed using topic modeling. This additional step in analyzing the data advances the field of qualitative research as it offers a means of reflecting on the comparison of the two sets of results.
With the increasing amount of qualitative data being generated in research studies, it is essential to have effective methods for analyzing and interpreting this information. The use of topic modeling in the analysis of textual data has gained popularity in recent years due to its ability to identify hidden themes and patterns within large datasets. This project describes a potential methodological advancement on how topic modeling can be used to analyze qualitative data.
Thus, the primary research question is:
What does topic modeling in the analysis of textual data from focus groups with inservice teachers participating in international professional development reveal about how topic modeling can enhance qualitative research data?
This study is designed to support researchers who generally utilize qualitative or quantitative methodologies exclusively but are interested in combining the two. Qualitative researchers who are interested in incorporating topic modeling will find these specific suggestions especially useful. This study will build on existing literature to offer a conceptual framework for the field to use. Findings will be used to provide suggestions for qualitative researchers interested in adding topic modeling to their research. Recommendations will be made about how topic modeling might be used in qualitative research in the future.
To analyze the textual data using topic modeling, the workflow of the process will include data wrangling, modeling, and exploration as follows:
The data was wrangled, a process which involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The text will be preprocessed, including converting the data into a tidy text format. This included data tokenization, removal of stop words, and stemming.
The text was then analyzed by fitting a topic modeling algorithm using Latent Dirichlet Allocation (LDA). A number of topics was selected. This process was repeated until the appropriate number of topics was decided upon. Then beta values were created to explore the findings. While topic modeling involves many decisions and can be as much art as science (Bail, 2018), the purpose of topic modeling in this context was to develop a simple mathematical summary of the dataset which can help further explore trends and patterns in the data.
library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
library(ggplot2)
library(dplyr)
After setting up the project in R, the data was imported into R.
CIDRE_interviews <- read_csv("data/Independent Analysis April 18 2023.csv",
col_types = cols(
Interview = col_character(),
Interviewee = col_character(),
Topic = col_character()
)
)
CIDRE_interviews <- CIDRE_interviews %>%
mutate(Interview = strsplit(as.character(Interview), "\n\n\n")) %>%
unnest(Interview)
CIDRE_interviews
## # A tibble: 333 × 3
## Interview Inter…¹ Topic
## <chr> <chr> <chr>
## 1 "Next is Chris and my picture. Let me make it bigger. Okay, th… Chris … <NA>
## 2 "making sure that she remembered in the photo that they were" Chris … <NA>
## 3 "…so this is the German theatre Museum. This one I went to on … Chris … <NA>
## 4 "Those are very grown up answers." Chris … <NA>
## 5 "We have to be grown up at some point.\n" Chris … <NA>
## 6 "\nThis is my photo of my personal experience. It was taken by… Kaitli… <NA>
## 7 "for the tape No it's a very it's a very like peaceful setting… Kaitli… <NA>
## 8 "Yeah, for the tape, it’s a swan, surrounded by its own feces … Kaitli… <NA>
## 9 "I feel badly that I already bailed on the public school syste… Karen … <NA>
## 10 "you did so much photographing. Do you have some favorite phot… Karen … <NA>
## # … with 323 more rows, and abbreviated variable name ¹Interviewee
The text was preprocessed, including converting the data into a tidy text format. This included data tokenization and removal of stop words.
First, the text was tokenized and stop words were removed. Additional stop words were added such as: like, just, well, yeah, lot, stuff, gonna.
interviews_tidy <- CIDRE_interviews %>%
unnest_tokens(output = word, input = Interview) %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "lot"& !word == "stuff"& !word == "gonna"& !word == "like"& !word == "just"& !word == "well"& !word == "yeah")
interviews_tidy
## # A tibble: 8,136 × 3
## Interviewee Topic word
## <chr> <chr> <chr>
## 1 Chris Alston <NA> chris
## 2 Chris Alston <NA> picture
## 3 Chris Alston <NA> bigger
## 4 Chris Alston <NA> picture
## 5 Chris Alston <NA> image
## 6 Chris Alston <NA> modern
## 7 Chris Alston <NA> art
## 8 Chris Alston <NA> museum
## 9 Chris Alston <NA> floor
## 10 Chris Alston <NA> chairs
## # … with 8,126 more rows
A word count to determine the most common words in the interviews was conducted. These word counts would be later used for creating a document term matrix for topic modeling.
interviews_tidy %>%
count(word, sort = TRUE)
## # A tibble: 2,759 × 2
## word n
## <chr> <int>
## 1 people 117
## 2 feel 68
## 3 time 68
## 4 experience 65
## 5 cool 45
## 6 world 45
## 7 kids 43
## 8 germany 42
## 9 picture 41
## 10 students 39
## # … with 2,749 more rows
The terms “people,” “feel,” “time,” and “experience” emerged as most common, reflecting that the study topic was an international learning experience. The terms “community,” “connections” and “professional” are worth attention as well. The terms “white” with 14 instances and “black,” with 13 instances suggests the potential to to explore a potential theme of race.
Each interview group was treated as a unique document, with a total of ten documents with 2759 terms. Using the existing word counts, a matrix was created that contained a column for each word in the corpus and a value of n for how many times that word occurs in each post.
To create this document term matrix from the interview counts, the cast_dtm() function was used and assigned to the variable interviews_dtm.
interviews_dtm <- interviews_tidy %>%
count(Topic, word) %>%
cast_dtm(Topic, word, n)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
## <<DocumentTermMatrix (documents: 10, terms: 2759)>>
## Non-/sparse entries: 3128/24462
## Sparsity : 89%
## Maximal term length: 17
## Weighting : term frequency (tf)
Next the original data set for structural topic modeling was prepared
using the textProcessor() function to remove punctuation
elements and stop words to simplify results.
temp <- textProcessor(CIDRE_interviews$Interview,
metadata = CIDRE_interviews,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
Stemming was considered as a preprocessing step to reduce the size of the vocabulary in natural language and thus simplify the model. Stemming reduced the number of terms from 2920 to 2241. This reduction in corpus size of 679 terms does not justify the risk of losing subtleties in meaning between words with similar word stems, such as “support” and “supported” that would be relevant to the findings of the study. Therefore, this project was not stemmed.
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
The text was modeled to create a mathematical summary of the dataset. The resulting summaries can help trends and patterns in the data to surface, particularly when compared to thematic analysis.
Two steps were undertaken:
The LDA function, or Latent Dirichlet allocation, was used because every document contains a mixture of topics and every topic contains a mixture of words. This means that a focus group interview could have an estimated topic proportion of 80% for Topic 1 but also be partly about Topic 2. Likewise, words can be shared between topics and words germane to the topic such as “community” and “language” might appear in an individual topic equally.
A K value was selected. Choosing the number of topics for a model is an important research decision and can dramatically impact results. K was selected as 20 as a starting point. K was then run as 15 and by 10 as a way of comparison before settling on a K value of 20.
interviews_lda <- LDA(interviews_dtm,
k = 20,
control = list(seed = 588)
)
interviews_lda
## A LDA_VEM topic model with 20 topics.
ap_topics <- tidy(interviews_lda, matrix = "beta")
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
slice_max(beta, n = 6) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
LDA employs the Term Frequency-Inverse Document Frequency (TF-IDF) metric to assign probabilities.
Silge and Robinson (2018) note that fitting at topic model is the “easy part.” The hard part is making sense of the model results and that the rest of the analysis involves exploring and interpreting the model using a variety of approaches.
The 5 most likely terms assigned to each topic were explored. These per-topic-per-word probabilities, or β (“beta”) values provide the probability of a term (word) belonging to a topic.
terms(interviews_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "people" "world" "wearing" "happened" "connections"
## [2,] "time" "lesson" "people" "culture" "teachers"
## [3,] "students" "garden" "traditional" "world" "traditions"
## [4,] "experience" "sustainable" "clothing" "understanding" "teaching"
## [5,] "world" "plan" "lederhosen" "learned" "program"
## Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11
## [1,] "people" "people" "time" "people" "experience" "people"
## [2,] "time" "time" "experience" "feel" "people" "experience"
## [3,] "picture" "world" "people" "experience" "time" "kids"
## [4,] "germany" "picture" "cool" "kids" "cool" "pictures"
## [5,] "kids" "guess" "read" "cool" "germany" "feel"
## Topic 12 Topic 13 Topic 14 Topic 15 Topic 16
## [1,] "feel" "cool" "similar" "cream" "time"
## [2,] "people" "world" "love" "picture" "experience"
## [3,] "experience" "favorite" "opportunities" "ice" "kids"
## [4,] "german" "padlet" "combination" "professional" "picture"
## [5,] "museum" "war" "learner" "museum" "thinking"
## Topic 17 Topic 18 Topic 19 Topic 20
## [1,] "people" "people" "people" "time"
## [2,] "time" "experience" "germany" "feel"
## [3,] "thinking" "picture" "guess" "cool"
## [4,] "trip" "kids" "talking" "students"
## [5,] "pictures" "world" "kids" "pictures"
tidy_lda <- tidy(interviews_lda)
tidy_lda
## # A tibble: 55,180 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 american 1.15e- 3
## 2 2 american 6.21e-42
## 3 3 american 4.83e-42
## 4 4 american 2.54e-37
## 5 5 american 8.77e- 3
## 6 6 american 7.09e- 4
## 7 7 american 9.35e- 5
## 8 8 american 2.40e- 4
## 9 9 american 2.08e- 3
## 10 10 american 1.10e- 3
## # … with 55,170 more rows
top_terms <- tidy_lda %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
The topics reveal general elements related to travel, but do not reflect deeper, more nuanced interpretations provided in thematic analysis. For example, Topic 3: “wearing, people, traditional, clothing, lederhosen,” reflects a general take on the culture of Germany.
Thematic analysis, conducted by researchers with personal relationships with participants suggests a nuanced, more sophisticated reading of the data but the topics do imply that the themes are related. For example, one qualitative theme that emerged was that of professional knowledge to teachers’ specific teaching content, curriculum, and classroom. Topic 12: “feel, people, experience, German, museum” could reflect how teachers expect to bring their experience in Germany (specifically in museums) to their classroom. Topic 2: “world, lesson, garden, sustainable, plan” could refer to one student’s plan to create a sustainable garden with her students in her classroom.
Another theme from the thematic analysis was that international immersive experience with targeted professional development provided an opportunity for reflections on pandemic/post-pandemic instruction. This theme does not seem to be borne out in the topic modeling.
The final theme from the qualitative research was that teachers reflected that the experience had an impact on their teaching dispositions and plans for the future. Topic 14, for example, “similar, love, opportunities, combination, learner” could reflect how one teacher hoped to have similar travel experiences in the future because she loves learning.
Comparing topics to initial findings from thematic analysis suggests that topic analysis can be used to determine if themes and topics generally align, and in this case they do.
This following is an example of a workflow drawn from qualitative research that integrates topic modeling with thematic analysis, which can be used for future researchers.
First, after the initial qualitative familiarization phase of the data conducted with thematic analysis (Braun & Clarke, 2022), researchers can run a topic model to validate initial findings. For example, in this study, a topic model was run after the initial codes had been developed. Topics and themes generally corresponded. Qualitative themes reflected more nuanced reflections from program participants related to teacher burnout, but topic modeling resulted in topics related to “black” and “white” that suggest themes related to race that did not surface as themes in the qualitative data. While this data set was relatively small, topic modeling after initial coding could be particularly useful for a dataset that is too large to read.
It is important to note that the value of this approach is related to the quality of the topic model. In the early states of qualitative research, this is particularly important when there is little coded data and the topics are not strong. As Bail (2018) warns, “…post-hoc interpretation of topic models is rather dangerous… and can quickly come to resemble the process of ‘reading tea leaves,’ or finding meaning in patterns that are in fact quite arbitrary or even random.”
One powerful way to integrate thematic analysis with topic modeling, which was not used in the present study but will be used with this data in the future, is to use the initial qualitative themes as a seed, as exemplified in Gillies et al., 2022. This is likely to improve the quality of the topics found by viewing them through the lens of human understanding of the data. A topic can be created for each theme discovered in qualitative analysis.
This interplay between qualitative themes and machine learned topics would make it possible to detect and correct problems as they occur. Instead of building a topic model at the beginning or end of the qualitative analysis, researchers build a topic model throughout the steps of coding the qualitative data. The topic model would be continuously compared to the researchers’ codes to consider suggestions for new documents. In other words, the topic model is informed by the thematic analysis.
Combining the present study (using topic modeling after developing initial thematic codes) with Gillies et al.’s approach (which integrates topic modeling along the workflow of thematic analysis) reveals an interactive approach that could yield stronger research results for future studies.
This interactive model leverages inherent weaknesses in both approaches. For example, while machine learning results can often be unexpected, this unpredictability can be beneficial. With this interactive approach, since errors in the topic model can be corrected by adding new codes, the researcher has greater control over the topic modeling. And while qualitative themes are chosen by subjective researchers, topic modeling can serve as another form of triangulation of the data. This “dance” of themes and topics also enables researchers to incorporate how themes evolve over time as part of their research findings. Taking into account the machine’s interpretation of a theme challenges researchers’ evolving human understanding of a research question.
This type of work is particularly challenging as it weaves two research methodologies, qualitative and quantitative, which have very different philosophies and can sometimes appear to be in opposition. Practitioners from both methodologies may be reluctant to endeavor to combine the forms.
Quantitative researchers may feel that thematic analysis is too subjective to combine with topic modeling. Qualitative researchers may be reluctant to utilize quantitative research methodologies, such as topic modeling, in their research. They might argue that it oversimplifies the complex human phenomenon being researched. In addition, qualitative researchers may lack literacy of machine learning and be reluctant to learn. At the present time, it may be challenging to find a qualitative research team with a literacy of topic modeling and its strengths and weaknesses.
Another limitation of this research is that the statistical model may be leaned on too heavily by researchers. As Blei explains, researchers should have a critical view of topics. While statistical models can help interpret and understand texts, it is still the scholar’s job to do the actual interpreting and understanding. He writes, “A model of texts, built with a particular theory in mind, cannot provide evidence for the theory” (2012, para 8).
This paper has provided an example of a blending of thematic qualitative research and topic modeling. This approach has the potential to initiate new forms of research that combine the benefits of human interpretation with those of automated processing.
Comparing results of thematic analysis conducted by researchers to topic modeling conducted in R can provide insights into the limitations and benefits of both forms of qualitative research.
The same description could be applied to thematic analysis, a process which is creative, time consuming, and can reflect the biases of the researchers. By comparing the results of both approaches, researchers and practitioners can gain a deeper understanding of the data and the context in which it was collected. This can lead to transformative insights into how qualitative research can be conducted in a more rigorous and systematic manner, while still allowing for the creativity and subjectivity that are inherent in this type of research. Moreover, reflecting on the limitations and benefits of both thematic analysis and topic modeling can help researchers and practitioners to identify areas for improvement in their own research practices, as well as provide guidance for future research studies. Ultimately, this can lead to more robust and meaningful qualitative research that contributes to the advancement of knowledge in a range of fields.
Braun, V., & Clarke, V. (2022). Conceptual and design thinking for thematic analysis. Qualitative Psychology, 9(1), 3–26. https://doi.org/10.1037/qup0000196
Bail, C. (2018). Strengths and weaknesses of text as data. Retrieved from https://sicss.io/2019/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html
Blei, D. M., (2012). Topic Modeling and Digital Humanities. Journal of Digital Humanities. 2(1). Retrieved from https://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/
Gillies, M., Murthy, D., Brenton, H., & Olaniyan, R. (2022). Theme and topic: How qualitative research and topic modeling can be brought together. arXiv preprint arXiv:2210.00707. https://doi.org/10.48550/arXiv.2210.00707
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media, Inc. Retrieved from: https://www.tidytextmining.com/topicmodeling.html
Wickham, H. & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc. Retrieved from https://r4ds.had.co.nz/