Abstract

This project reports a methodological advancement on how topic modeling can be used to analyze qualitative data from transcripts of focus groups conducted with participants in an international professional development program. The program is designed to prepare inservice educators for culturally-responsive teaching (CRT) through practice with the development of technical representations of cultural themes in an international context. Themes developed using topic modeling can be compared to existing thematic analysis conducted by researchers on the same data. This will provide insight into the strengths and weaknesses of each approach and demonstrate the potential of topic modeling to enhance qualitative data analysis. This will be useful for future researchers interested in the interaction between qualitative methodology and topic modeling.


1. PREPARE

1a. Data sources used for text analysis

Context: Professional Development Abroad

The data are from an ongoing, international, professional development program for inservice teachers. It includes three Saturday classes in the spring, two weeks of a study abroad program in the summer to an international destination (in 2022, to Munich Germany) and one follow up session in the fall. In this professional development abroad, teachers build a portfolio that represents a chosen cultural theme, including a lesson plan illustrating how they plan to apply technical cultural representation and/or analysis in their own classrooms. The program aims to provide teachers with experiences practicing with cultural frames and representational tools, so they can work with their own students to elicit and represent diverse cultural identities and perspectives.

Data Collection and Analysis

First, using qualitative methods, the research team analyzed data collected during focus groups, from digital portfolios, and from observations during the professional development program. Data collected consisted of researcher memos, projects and artifacts created by 19 teachers, and post-reflective focus groups after the professional development experience. Using the transcripts from the focus groups, participants’ portfolios, and researcher observations, Braun and Clarke’s (2022) process of reflexive thematic analysis to analyze focus group data was used. 

Initial findings from thematic analysis of focus group data indicated that

1.) Self-selected digital projects allowed participants to develop professional knowledge related to their specific teaching content, curriculum, and classroom;

2.) International immersive experience with targeted professional development provided an opportunity for reflections on pandemic/post-pandemic instruction; and

3.) Teachers reflected that the experience had an impact on their teaching dispositions and plans for the future.

This project will outline a second step in this process, in which the same data will be analyzed using topic modeling. Topic modeling is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans. This additional step in analyzing the data advances the field of qualitative research as it offers a means of reflecting on the comparison of the two sets of results.

1b. Guiding Question

With the increasing amount of qualitative data being generated in research studies, it is essential to have effective methods for analyzing and interpreting this information. The use of topic modeling in the analysis of textual data has gained popularity in recent years due to its ability to identify hidden themes and patterns within large datasets. This project describes a potential methodological advancement on how topic modeling can be used to analyze qualitative data.

Thus, the primary research question is:

What does topic modeling in the analysis of textual data from focus groups with inservice teachers participating in international professional development reveal about how topic modeling can enhance qualitative research data?

2. WRANGLE

To analyze the textual data using topic modeling, the workflow of the process will include data wrangling, modeling, and exploration as follows: 

The data will be wrangled, a process which involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The text will be preprocessed, including converting the data into a tidy text format. This includes data tokenization, removal of stop words, and stemming.

The text will then be analyzed by fitting a topic modeling algorithm using Latent Dirichlet Allocation (LDA). Then beta values were created to explore the findings. A number of topics were selected. This process was repeated until the appropriate number of topics was decided upon. While topic modeling involves many decisions and can be as much art as science (Bail, 2018), the purpose of topic modeling in this context was to develop a simple mathematical summary of the dataset which can help further explore trends and patterns in the data.

2a. Project Set Up and Import Interview Data

library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
library(ggplot2)
library(dplyr)

After setting up the project in R, the data will be imported into R.

CIDRE_interviews <- read_csv("data/Independent Analysis April 18 2023.csv", 
     col_types = cols(
                   Interview = col_character(), 
                   Interviewee = col_character(), 
                   Topic = col_character()
                   )
    )

CIDRE_interviews <- CIDRE_interviews %>% 
    mutate(Interview = strsplit(as.character(Interview), "\n\n\n")) %>% 
    unnest(Interview)

CIDRE_interviews
## # A tibble: 333 × 3
##    Interview                                                       Inter…¹ Topic
##    <chr>                                                           <chr>   <chr>
##  1 "Next is Chris and my picture. Let me make it bigger. Okay, th… Chris … <NA> 
##  2 "making sure that she remembered in the photo that they were"   Chris … <NA> 
##  3 "…so this is the German theatre Museum. This one I went to on … Chris … <NA> 
##  4 "Those are very grown up answers."                              Chris … <NA> 
##  5 "We have to be grown up at some point.\n"                       Chris … <NA> 
##  6 "\nThis is my photo of my personal experience. It was taken by… Kaitli… <NA> 
##  7 "for the tape No it's a very it's a very like peaceful setting… Kaitli… <NA> 
##  8 "Yeah, for the tape, it’s a swan, surrounded by its own feces … Kaitli… <NA> 
##  9 "I feel badly that I already bailed on the public school syste… Karen … <NA> 
## 10 "you did so much photographing. Do you have some favorite phot… Karen … <NA> 
## # … with 323 more rows, and abbreviated variable name ¹​Interviewee

2b. Cast a Document Term Matrix

The text was preprocessed, including converting the data into a tidy text format. This included data tokenization and removal of stop words.

First, the text was tokenized and stop words were removed. Additional stop words were added such as: like, just, well, yeah, lot, stuff, gonna

interviews_tidy <- CIDRE_interviews %>% 
  unnest_tokens(output = word, input = Interview) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "lot"& !word == "stuff"& !word == "gonna"& !word == "like"& !word == "just"& !word == "well"& !word == "yeah") 



interviews_tidy
## # A tibble: 8,136 × 3
##    Interviewee  Topic word   
##    <chr>        <chr> <chr>  
##  1 Chris Alston <NA>  chris  
##  2 Chris Alston <NA>  picture
##  3 Chris Alston <NA>  bigger 
##  4 Chris Alston <NA>  picture
##  5 Chris Alston <NA>  image  
##  6 Chris Alston <NA>  modern 
##  7 Chris Alston <NA>  art    
##  8 Chris Alston <NA>  museum 
##  9 Chris Alston <NA>  floor  
## 10 Chris Alston <NA>  chairs 
## # … with 8,126 more rows

A word count to determine the most common words in the interviews was conducted. These word counts would be later used for creating a document term matrix for topic modeling.

interviews_tidy %>%
  count(word, sort = TRUE)
## # A tibble: 2,759 × 2
##    word           n
##    <chr>      <int>
##  1 people       117
##  2 feel          68
##  3 time          68
##  4 experience    65
##  5 cool          45
##  6 world         45
##  7 kids          43
##  8 germany       42
##  9 picture       41
## 10 students      39
## # … with 2,749 more rows

The terms “people,” “feel,” “time,” and “experience” emerged as most common, reflecting that the study topic was an international learning experience. The terms “community,” “connections” and “professional” are worth attention as well. The terms “white” with 14 instances and “black,” with 13 instances suggests the potential to to explore a potential theme of race.

Creating a Document Term Matrix

Each interview group was treated as a unique document, with a total of ten documents with 2759 terms. Using the existing word counts, a matrix was created that contained a column for each word in the corpus and a value of n for how many times that word occurs in each post.

To create this document term matrix from the interview counts, the cast_dtm() function was used and assigned to the variable interviews_dtm.

interviews_dtm <- interviews_tidy %>%
  count(Topic, word) %>%
  cast_dtm(Topic, word, n)
## [1] "DocumentTermMatrix"    "simple_triplet_matrix"
## <<DocumentTermMatrix (documents: 10, terms: 2759)>>
## Non-/sparse entries: 3128/24462
## Sparsity           : 89%
## Maximal term length: 17
## Weighting          : term frequency (tf)

2c. Preprocessing and (not) Stemming

Next the original data set for structural topic modeling was prepared using the textProcessor() function to remove punctuation elements and stop words to simplify results.

temp <- textProcessor(CIDRE_interviews$Interview, 
                    metadata = CIDRE_interviews,  
                    lowercase=TRUE, 
                    removestopwords=TRUE, 
                    removenumbers=TRUE,  
                    removepunctuation=TRUE, 
                    wordLengths=c(3,Inf),
                    stem=TRUE,
                    onlycharacter= FALSE, 
                    striphtml=TRUE, 
                    customstopwords=NULL)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

Stemming was considered as a preprocessing step to reduce the size of the vocabulary in natural language and thus simplify the model. Stemming reduced the number of terms from 2920 to 2241. This reduction in corpus size of 679 terms does not justify the risk of losing subtleties in meaning between words with similar word stems, such as “support” and “supported” that would be relevant to the findings of the study. Therefore, this project does not use stemming.

meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents

3. MODEL

The text was modeled to create a mathematical summary of the dataset. The resulting summaries can help trends and patterns in the data to surface, particularly when compared to thematic analysis. Topic Modeling is an unsupervised learning approach that can provide insight into the structure of the dataset.

Three steps were undertaken:

  1. Fitting a Topic Modeling with LDA. The topicmodels package and associated LDA() function for unsupervised classification of the interview data was used to find natural groupings of words, or topics.
  2. Fitting a Structural Topic Model. The stm package and stm() function were used to fit our model and used metadata about documents to improve the assignment of words to “topics” in the corpus.
  3. Choosing K. Finally, an appropriate number of topics was selected.

3a. Fitting a Topic Modeling with LDA

The LDA function or Latent Dirichlet allocation was used because every document contains a mixture of topics and every topic contains a mixture of words. This means that a focus group interview could have an estimated topic proportion of 80% for Topic 1 but also be partly about topic 2. Likewise, words can be shared between topics and words germane to the topic such as “community” and “language” might appear in an individual topic equally.

LDA requires a k value to be specified for the number of topics in the focus group interviews. K was selected as 20 as a starting point. K was then run as 15 by way of comparison.

interviews_lda <- LDA(interviews_dtm, 
                  k = 20,
                  control = list(seed = 588)
                  )

interviews_lda
## A LDA_VEM topic model with 20 topics.
ap_topics <- tidy(interviews_lda, matrix = "beta")
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 6) %>% 
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

LDA employs the Term Frequency-Inverse Document Frequency (TF-IDF) metric to assign probabilities.

3b. Fitting a Structural Topic Model

Bail (2018) argues that one reason STM has rising in popularity and use is that it employs meta data about documents to improve the assignment of words to topics in a corpus and that can be used to examine relationships between covariates and documents. This was useful for this project in confirming that topics related to themes revealed in qualtitative data.

The stm Package

Before fitting an STM model, it was necessary to extract the following elements:

docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab 

These elements were then used to fit the model using the same topics for K that were specified for the LDA topic model.

interviews_stm <- stm(documents=docs, 
         data=meta,
         vocab=vocab, 
         K=20,
         max.em.its=25,
         verbose = FALSE)

interviews_stm
## A topic model with 20 topics, 308 documents and a 2329 word dictionary.

The function allows for viewing the most probable words assigned to each topic.

plot.STM(interviews_stm, n = 5)

plot(interviews_stm, n = 5)

3c. Finding K

As alluded to earlier, selecting the number of topics for your model is a non-trivial decision and can dramatically impact your results. Bail (2018) notes that

The results of topic models should not be over-interpreted unless the researcher has strong theoretical apriori about the number of topics in a given corpus, or if the researcher has carefully validated the results of a topic model using both the quantitative and qualitative techniques described above.

The FindTopicsNumber Function

The ldatuning package was used to assist with finding K value.

k_metrics <- FindTopicsNumber(
  interviews_dtm,
  topics = seq(10, 75, by = 5),
  metrics = "Griffiths2004",
  method = "Gibbs",
  control = list(),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)

FindTopicsNumber_plot(k_metrics)

Note that the FindTopicNumbers() function contains three additional metrics for calculating metrics that can be used to estimate the most preferable number of topics for LDA model. We used the Griffiths2004 metrics included in the default example and I’ve also found this to produce the most interpretable results as show in the figure below:

As a general rule of thumb and overly simplistic heuristic, we’re looking for an inflection point in our plot which indicates an optimal number of topics to select for a value of K.

The LDAvis Explorer

The LDAvis explorere was used to expore topic and word distributions.

toLDAvis(mod = interviews_stm, docs = docs)
## Loading required namespace: servr

4. EXPLORE & MODEL

Silge and Robinson (2018) note that fitting at topic model is the “easy part.” The hard part is making sense of the model results and that the rest of the analysis involves exploring and interpreting the model using a variety of approaches which we’ll walkthrough in in this section.

Bail (2018) cautions, however, that:

…post-hoc interpretation of topic models is rather dangerous… and can quickly come to resemble the process of “reading tea leaves,” or finding meaning in patterns that are in fact quite arbitrary or even random.

4a. Exploring Beta Values

The 5 most likely terms assigned to each topic were explored. These per-topic-per-word probabilities, or β (“beta”) values provide the probability of a term (word) belonging to a topic.

terms(interviews_lda, 5)
##      Topic 1      Topic 2       Topic 3       Topic 4         Topic 5      
## [1,] "people"     "world"       "wearing"     "happened"      "connections"
## [2,] "time"       "lesson"      "people"      "culture"       "teachers"   
## [3,] "students"   "garden"      "traditional" "world"         "traditions" 
## [4,] "experience" "sustainable" "clothing"    "understanding" "teaching"   
## [5,] "world"      "plan"        "lederhosen"  "learned"       "program"    
##      Topic 6   Topic 7   Topic 8      Topic 9      Topic 10     Topic 11    
## [1,] "people"  "people"  "time"       "people"     "experience" "people"    
## [2,] "time"    "time"    "experience" "feel"       "people"     "experience"
## [3,] "picture" "world"   "people"     "experience" "time"       "kids"      
## [4,] "germany" "picture" "cool"       "kids"       "cool"       "pictures"  
## [5,] "kids"    "guess"   "read"       "cool"       "germany"    "feel"      
##      Topic 12     Topic 13   Topic 14        Topic 15       Topic 16    
## [1,] "feel"       "cool"     "similar"       "cream"        "time"      
## [2,] "people"     "world"    "love"          "picture"      "experience"
## [3,] "experience" "favorite" "opportunities" "ice"          "kids"      
## [4,] "german"     "padlet"   "combination"   "professional" "picture"   
## [5,] "museum"     "war"      "learner"       "museum"       "thinking"  
##      Topic 17   Topic 18     Topic 19  Topic 20  
## [1,] "people"   "people"     "people"  "time"    
## [2,] "time"     "experience" "germany" "feel"    
## [3,] "thinking" "picture"    "guess"   "cool"    
## [4,] "trip"     "kids"       "talking" "students"
## [5,] "pictures" "world"      "kids"    "pictures"

Topic 10 (experience, time, people, cool, Germany) seems to be about participants experience in Germany.This correlates with the kind of data found in qualitative research on the same data.

tidy_lda <- tidy(interviews_lda)

tidy_lda
## # A tibble: 55,180 × 3
##    topic term         beta
##    <int> <chr>       <dbl>
##  1     1 american 1.15e- 3
##  2     2 american 6.21e-42
##  3     3 american 4.83e-42
##  4     4 american 2.54e-37
##  5     5 american 8.77e- 3
##  6     6 american 7.09e- 4
##  7     7 american 9.35e- 5
##  8     8 american 2.40e- 4
##  9     9 american 2.08e- 3
## 10    10 american 1.10e- 3
## # … with 55,170 more rows
top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

5. COMMUNICATE

As Silge and Robinson note (2018), fitting the topic model is easy relative to the more challenging work of interpreting the model results.

Comparing results of thematic analysis conducted by researchers to topic modeling conducted in R can provide insights into the limitations and benefits of both forms of qualitative research. As Bail (2018) warns, “…post-hoc interpretation of topic models is rather dangerous… and can quickly come to resemble the process of ‘reading tea leaves,’ or finding meaning in patterns that are in fact quite arbitrary or even random.”

The same description could be applied to thematic analysis, a process which is creative, time consuming, and can reflect the biases of the researchers. By comparing the results of both approaches, researchers and practitioners can gain a deeper understanding of the data and the context in which it was collected. This can lead to transformative insights into how qualitative research can be conducted in a more rigorous and systematic manner, while still allowing for the creativity and subjectivity that are inherent in this type of research. Moreover, reflecting on the limitations and benefits of both thematic analysis and topic modeling can help researchers and practitioners to identify areas for improvement in their own research practices, as well as provide guidance for future research studies. Ultimately, this can lead to more robust and meaningful qualitative research that contributes to the advancement of knowledge in a range of fields.

Initial findings from thematic analysis of focus group data indicated that

1.) Self-selected digital projects allowed participants to develop professional knowledge related to their specific teaching content, curriculum, and classroom;

2.) International immersive experience with targeted professional development provided an opportunity for reflections on pandemic/post-pandemic instruction; and

3.) Teachers reflected that the experience had an impact on their teaching dispositions and plans for the future.

Comparing topics to intitial findings from thematic analysis suggest that topic analysis reflects general, superficial interperetations of the expeirence. Words such as “cool” “experience” “people” “feel” reflect a postive, but superficial experience in Germany. Thematic analysis, conducted by researchers with personal relationships with participants suggests a nuanced, more sophisticated reading of the data. This anaylsis reflects outcomes from the experience with suggest deep change in teacher values and practices which do not seem to be suggested from topic modeling. While the text minign approach allows for analysis of a large corpus of data, qualitative reseearch from a team offers a more in-depth analysis. The two have value together in that topic modeling can be used with a larger corpus of data and can offer a confirmation of the general direction of themes created in thematic analysis. The use of topic modeling in addition to thematic analysis can provide a form of triangulation valuable to the credibility of a qualitative research study.

Experience examining which words tend to follow others immediately, or that tend to co-occur within the same documents.

Comparing results of thematic analysis conducted by researchers to topic modeling conducted in R can provide insights into the limitations and benefits of both forms of qualitative research. As Bail (2018) warns, “…post-hoc interpretation of topic models is rather dangerous… and can quickly come to resemble the process of ‘reading tea leaves,’ or finding meaning in patterns that are in fact quite arbitrary or even random.” The same description could be applied to thematic analysis, a process which is creative, time consuming, and can reflect the biases of the researchers. By comparing the results of both approaches, researchers and practitioners can gain a deeper understanding of the data and the context in which it was collected. This can lead to transformative insights into how qualitative research can be conducted in a more rigorous and systematic manner, while still allowing for the creativity and subjectivity that are inherent in this type of research. Moreover, reflecting on the limitations and benefits of both thematic analysis and topic modeling can help researchers and practitioners to identify areas for improvement in their own research practices, as well as provide guidance for future research studies. Ultimately, this can lead to more robust and meaningful qualitative research that contributes to the advancement of knowledge in a range of fields.

References

Braun, V., & Clarke, V. (2022). Conceptual and design thinking for thematic analysis. Qualitative Psychology, 9(1), 3–26. https://doi.org/10.1037/qup0000196

Bail, C. (2018). Strengths and weaknesses of text as data. Retrieved from https://sicss.io/2019/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html

Gillies, M., Murthy, D., Brenton, H., & Olaniyan, R. (2022). Theme and topic: How qualitative research and topic modeling can be brought together. arXiv preprint arXiv:2210.00707. https://doi.org/10.48550/arXiv.2210.00707

Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O’Reilly Media, Inc. Retrieved from: https://www.tidytextmining.com/topicmodeling.html

Wickham, H. & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc. Retrieved from https://r4ds.had.co.nz/


---
title: "Using R to Analyze Qualitative Data from Teachers’ International Professional Development Experience"
output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: yes
    code_folding: hide
    code_download: TRUE
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

### Abstract

This project reports a methodological advancement on how topic modeling
can be used to analyze qualitative data from transcripts of focus groups
conducted with participants in an international professional development
program. The program is designed to prepare inservice educators for culturally-responsive
teaching (CRT) through practice with the development of technical
representations of cultural themes in an international context. Themes
developed using topic modeling can be compared to existing thematic
analysis conducted by researchers on the same data. This will provide
insight into the strengths and weaknesses of each approach and
demonstrate the potential of topic modeling to enhance qualitative data
analysis. This will be useful for future researchers interested in the
interaction between qualitative methodology and topic modeling.

------------------------------------------------------------------------

## 1. PREPARE

### 1a. **Data sources used for text analysis**

**Context: Professional Development Abroad**

The data are from an ongoing, international, professional development
program for inservice teachers. It includes three Saturday classes in
the spring, two weeks of a study abroad program in the summer to an
international destination (in 2022, to Munich Germany) and one follow up
session in the fall. In this professional development abroad, teachers
build a portfolio that represents a chosen cultural theme, including a
lesson plan illustrating how they plan to apply technical cultural
representation and/or analysis in their own classrooms. The program aims
to provide teachers with experiences practicing with cultural frames and
representational tools, so they can work with their own students to
elicit and represent diverse cultural identities and perspectives.

**Data Collection and Analysis**

First, using qualitative methods, the research team analyzed data
collected during focus groups, from digital portfolios, and from
observations during the professional development program. Data collected
consisted of researcher memos, projects and artifacts created by 19
teachers, and post-reflective focus groups after the professional
development experience. Using the transcripts from the focus groups,
participants' portfolios, and researcher observations, Braun and
Clarke's (2022) process of reflexive thematic analysis to analyze focus
group data was used. 

Initial findings from thematic analysis of focus group data indicated
that

1.) Self-selected digital projects allowed participants to develop
**professional knowledge** related to their specific teaching content,
curriculum, and classroom;

2.) International immersive experience with targeted professional
development provided an opportunity for **reflections on
pandemic/post-pandemic instruction**; and

3.) Teachers reflected that the experience had an impact on their
**teaching dispositions and plans for the future.**

This project will outline a second step in this process, in which the
same data will be analyzed using topic modeling. Topic modeling is a
machine learning technique that automatically analyzes text data to
determine cluster words for a set of documents. This is known as
'unsupervised' machine learning because it doesn't require a predefined
list of tags or training data that's been previously classified by
humans. This additional step in analyzing the data advances the field of
qualitative research as it offers a means of reflecting on the
comparison of the two sets of results.

### 1b. Guiding Question

With the increasing amount of qualitative data being generated in
research studies, it is essential to have effective methods for
analyzing and interpreting this information. The use of topic modeling
in the analysis of textual data has gained popularity in recent years
due to its ability to identify hidden themes and patterns within large
datasets. This project describes a potential methodological advancement
on how topic modeling can be used to analyze qualitative data.

Thus, the primary research question is:

**What does topic modeling in the analysis of textual data from focus
groups with inservice teachers participating in international
professional development reveal about how topic modeling can enhance
qualitative research data?**

## 2. WRANGLE

To analyze the textual data using topic modeling, the workflow of the
process will include data wrangling, modeling, and exploration as
follows: 

The data will be wrangled, a process which involves some combination of
cleaning, reshaping, transforming, and merging data (Wickham &
Grolemund, 2017). The text will be preprocessed, including converting
the data into a tidy text format. This includes data tokenization,
removal of stop words, and stemming.

The text will then be analyzed by fitting a topic modeling algorithm
using Latent Dirichlet Allocation (LDA). Then beta values were created to
explore the findings. A number of topics were selected. This process was repeated
until the appropriate number of topics was decided upon. While topic
modeling involves many decisions and can be as much art as science
(Bail, 2018), the purpose of topic modeling in this context was to
develop a simple mathematical summary of the dataset which can help further explore trends and patterns in the data.

### 2a. Project Set Up and Import Interview Data

```{r load-packages, message=FALSE}
library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(LDAvis)
library(ggplot2)
library(dplyr)
```

After setting up the project in R, the data will be imported into R.

```{r read-csv}
CIDRE_interviews <- read_csv("data/Independent Analysis April 18 2023.csv", 
     col_types = cols(
                   Interview = col_character(), 
                   Interviewee = col_character(), 
                   Topic = col_character()
                   )
    )

CIDRE_interviews <- CIDRE_interviews %>% 
    mutate(Interview = strsplit(as.character(Interview), "\n\n\n")) %>% 
    unnest(Interview)

CIDRE_interviews
```

### 2b. Cast a Document Term Matrix

The text was preprocessed, including converting the data into a tidy
text format. This included data tokenization and removal of stop words.

First, the text was tokenized and stop words were removed. Additional
stop words were added such as: like, just, well, yeah, lot, stuff, gonna

```{r tokenize-interviews}
interviews_tidy <- CIDRE_interviews %>% 
  unnest_tokens(output = word, input = Interview) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "lot"& !word == "stuff"& !word == "gonna"& !word == "like"& !word == "just"& !word == "well"& !word == "yeah") 



interviews_tidy
```

A word count to determine the most common words in the interviews was
conducted. These word counts would be later used for creating a document
term matrix for topic modeling.

```{r count-words}
interviews_tidy %>%
  count(word, sort = TRUE)
```

The terms "people," "feel," "time," and "experience" emerged as most
common, reflecting that the study topic was an international learning
experience. The terms "community," "connections" and "professional" are
worth attention as well. The terms "white" with 14 instances and
"black," with 13 instances suggests the potential to 
to explore a potential theme of race.



#### Creating a Document Term Matrix

Each interview group was treated as a unique document, with a total of
ten documents with 2759 terms. Using the existing word counts, a matrix
was created that contained a column for each word in the corpus and a
value of n for how many times that word occurs in each post.

To create this document term matrix from the interview counts, the
cast_dtm() function was used and assigned to the variable
interviews_dtm.

```{r cast-dtm}
interviews_dtm <- interviews_tidy %>%
  count(Topic, word) %>%
  cast_dtm(Topic, word, n)
```

```{r class-dtm, echo=FALSE}
class(interviews_dtm)

interviews_dtm
```

### 2c. Preprocessing and (not) Stemming

Next the original data set for structural topic modeling was prepared
using the `textProcessor()` function to remove punctuation elements and
stop words to simplify results.

```{r textProcessor}
temp <- textProcessor(CIDRE_interviews$Interview, 
                    metadata = CIDRE_interviews,  
                    lowercase=TRUE, 
                    removestopwords=TRUE, 
                    removenumbers=TRUE,  
                    removepunctuation=TRUE, 
                    wordLengths=c(3,Inf),
                    stem=TRUE,
                    onlycharacter= FALSE, 
                    striphtml=TRUE, 
                    customstopwords=NULL)
```

Stemming was considered as a preprocessing step to reduce the size of
the vocabulary in natural language and thus simplify the model. Stemming
reduced the number of terms from 2920 to 2241. This reduction in corpus
size of 679 terms does not justify the risk of losing subtleties in
meaning between words with similar word stems, such as "support" and
"supported" that would be relevant to the findings of the study.
Therefore, this project does not use stemming.

```{r stm-inputs}
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
```

## 3. MODEL

The text was modeled to create a mathematical summary of the dataset.
The resulting summaries can help trends and patterns in the data to
surface, particularly when compared to thematic analysis. Topic Modeling
is an unsupervised learning approach that can provide insight into the
structure of the dataset.

Three steps were undertaken:

a.  **Fitting a Topic Modeling with LDA**. The `topicmodels` package and
    associated `LDA()` function for unsupervised classification of the
    interview data was used to find natural groupings of words, or
    topics.
b.  **Fitting a Structural Topic Model**. The `stm` package and `stm()`
    function were used to fit our model and used metadata about
    documents to improve the assignment of words to "topics" in the
    corpus.
c.  **Choosing K.** Finally, an appropriate number of topics was
    selected.

### 3a. Fitting a Topic Modeling with LDA

The LDA function or Latent Dirichlet allocation was used because every
document contains a mixture of topics and every topic contains a mixture
of words. This means that a focus group interview could have an
estimated topic proportion of 80% for Topic 1 but also be partly about
topic 2. Likewise, words can be shared between topics and words germane
to the topic such as "community" and "language" might appear in an
individual topic equally.

LDA requires a k value to be specified for the number of topics in the
focus group interviews. K was selected as 20 as a starting point. K was
then run as 15 by way of comparison.

```{r LDA}

interviews_lda <- LDA(interviews_dtm, 
                  k = 20,
                  control = list(seed = 588)
                  )

interviews_lda
```

```{r}
ap_topics <- tidy(interviews_lda, matrix = "beta")
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 6) %>% 
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
```
LDA employs the Term Frequency-Inverse Document Frequency (TF-IDF)
metric to assign probabilities.

### 3b. Fitting a Structural Topic Model

Bail (2018) argues that one reason STM has rising in popularity and use is that it employs meta data about documents to improve the assignment of words to topics in a corpus and that can be used to examine relationships between covariates and documents. This was useful for this project in confirming that topics related to themes revealed in qualtitative data. 

#### The `stm` Package

Before fitting an STM model, it was necessary to extract the following elements:

```{r stm-docs}
docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab 
```

These elements were then used to fit the model using the same topics for *K* that were specified for the LDA topic model. 

```{r stm}
interviews_stm <- stm(documents=docs, 
         data=meta,
         vocab=vocab, 
         K=20,
         max.em.its=25,
         verbose = FALSE)

interviews_stm
```

The function allows for viewing the most probable words assigned to each topic.

```{r plot-stm}
plot.STM(interviews_stm, n = 5)
```


```{r plot}
plot(interviews_stm, n = 5)
```

##### 

### 3c. Finding *K*

As alluded to earlier, selecting the number of topics for your model is
a non-trivial decision and can dramatically impact your results. Bail
(2018) notes that

> *The results of topic models should not be over-interpreted unless the
> researcher has strong theoretical apriori about the number of topics
> in a given corpus, or if the researcher has carefully validated the
> results of a topic model using both the quantitative and qualitative
> techniques described above.*


#### The FindTopicsNumber Function

The `ldatuning` package was used to assist with finding K value.

```{r find-topic, eval=FALSE}
k_metrics <- FindTopicsNumber(
  interviews_dtm,
  topics = seq(10, 75, by = 5),
  metrics = "Griffiths2004",
  method = "Gibbs",
  control = list(),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)

FindTopicsNumber_plot(k_metrics)
```

Note that the `FindTopicNumbers()` function contains three additional
metrics for calculating metrics that can be used to estimate the most
preferable number of topics for LDA model. We used the Griffiths2004
metrics included in the default example and I've also found this to
produce the most interpretable results as show in the figure below:

![](img/k_metrics.png){width="90%"}

As a general rule of thumb and overly simplistic heuristic, we're
looking for an inflection point in our plot which indicates an optimal
number of topics to select for a value of K.

#### The LDAvis Explorer

The LDAvis explorere was used to expore topic and word distributions.

```{r LDAvis}
toLDAvis(mod = interviews_stm, docs = docs)
```

## 4. EXPLORE & MODEL

Silge and Robinson (2018) note that fitting at topic model is the "easy
part." The hard part is making sense of the model results and that the
rest of the analysis involves exploring and interpreting the model using
a variety of approaches which we'll walkthrough in in this section.

Bail (2018) cautions, however, that:

> *...post-hoc interpretation of topic models is rather dangerous... and
> can quickly come to resemble the process of "reading tea leaves," or
> finding meaning in patterns that are in fact quite arbitrary or even
> random.*

### 4a. Exploring Beta Values

The 5 most likely terms assigned to each topic were explored. These per-topic-per-word probabilities, or  β ("beta") values provide the probability of a term (word) belonging to a topic.


```{r terms}
terms(interviews_lda, 5)
```

Topic 10 (experience, time, people, cool,
Germany) seems to be about participants experience in Germany.This correlates with the kind of data found in qualitative research on the same data. 

```{r tidy_lda}

tidy_lda <- tidy(interviews_lda)

tidy_lda
```


```{r top_terms}

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")
```

### 

## 5. COMMUNICATE

As Silge and Robinson note (2018), fitting the topic model is easy
relative to the more challenging work of interpreting the model results.

Comparing results of thematic analysis conducted by researchers to topic
modeling conducted in R can provide insights into the limitations and
benefits of both forms of qualitative research. As Bail (2018) warns,
"...post-hoc interpretation of topic models is rather dangerous... and
can quickly come to resemble the process of 'reading tea leaves,' or
finding meaning in patterns that are in fact quite arbitrary or even
random."

The same description could be applied to thematic analysis, a process
which is creative, time consuming, and can reflect the biases of the
researchers. By comparing the results of both approaches, researchers
and practitioners can gain a deeper understanding of the data and the
context in which it was collected. This can lead to transformative
insights into how qualitative research can be conducted in a more
rigorous and systematic manner, while still allowing for the creativity
and subjectivity that are inherent in this type of research. Moreover,
reflecting on the limitations and benefits of both thematic analysis and
topic modeling can help researchers and practitioners to identify areas
for improvement in their own research practices, as well as provide
guidance for future research studies. Ultimately, this can lead to more
robust and meaningful qualitative research that contributes to the
advancement of knowledge in a range of fields.


Initial findings from thematic analysis of focus group data indicated
that

1.) Self-selected digital projects allowed participants to develop
**professional knowledge** related to their specific teaching content,
curriculum, and classroom;

2.) International immersive experience with targeted professional
development provided an opportunity for **reflections on
pandemic/post-pandemic instruction**; and

3.) Teachers reflected that the experience had an impact on their
**teaching dispositions and plans for the future.**

Comparing topics to intitial findings from thematic analysis suggest
that topic analysis reflects general, superficial interperetations of
the expeirence. Words such as "cool" "experience" "people" "feel"
reflect a postive, but superficial experience in Germany. Thematic
analysis, conducted by researchers with personal relationships with
participants suggests a nuanced, more sophisticated reading of the data.
This anaylsis reflects outcomes from the experience with suggest deep
change in teacher values and practices which do not seem to be suggested
from topic modeling. While the text minign approach allows for analysis
of a large corpus of data, qualitative reseearch from a team offers a
more in-depth analysis. The two have value together in that topic
modeling can be used with a larger corpus of data and can offer a
confirmation of the general direction of themes created in thematic
analysis. The use of topic modeling in addition to thematic analysis can
provide a form of triangulation valuable to the credibility of a
qualitative research study.

Experience examining which words tend to follow others immediately, or
that tend to co-occur within the same documents.

Comparing results of thematic analysis conducted by researchers to topic
modeling conducted in R can provide insights into the limitations and
benefits of both forms of qualitative research. As Bail (2018) warns,
"...post-hoc interpretation of topic models is rather dangerous... and
can quickly come to resemble the process of 'reading tea leaves,' or
finding meaning in patterns that are in fact quite arbitrary or even
random." The same description could be applied to thematic analysis, a
process which is creative, time consuming, and can reflect the biases of
the researchers. By comparing the results of both approaches,
researchers and practitioners can gain a deeper understanding of the
data and the context in which it was collected. This can lead to
transformative insights into how qualitative research can be conducted
in a more rigorous and systematic manner, while still allowing for the
creativity and subjectivity that are inherent in this type of research.
Moreover, reflecting on the limitations and benefits of both thematic
analysis and topic modeling can help researchers and practitioners to
identify areas for improvement in their own research practices, as well
as provide guidance for future research studies. Ultimately, this can
lead to more robust and meaningful qualitative research that contributes
to the advancement of knowledge in a range of fields.


**References**

Braun, V., & Clarke, V. (2022). Conceptual and design thinking for
thematic analysis. Qualitative Psychology, 9(1), 3--26.
[https://doi.org/10.1037/qup0000196](https://psycnet.apa.org/doi/10.1037/qup0000196)

Bail, C. (2018). Strengths and weaknesses of text as data. Retrieved
from
<https://sicss.io/2019/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html>

Gillies, M., Murthy, D., Brenton, H., & Olaniyan, R. (2022). Theme and
topic: How qualitative research and topic modeling can be brought
together. arXiv preprint arXiv:2210.00707.
<https://doi.org/10.48550/arXiv.2210.00707>

Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach.
O'Reilly Media, Inc. Retrieved from:
<https://www.tidytextmining.com/topicmodeling.html>

Wickham, H. & Grolemund, G. (2017). R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data. O'Reilly Media, Inc. Retrieved
from <https://r4ds.had.co.nz/>

\
