With the increasing amount of qualitative data being generated in research studies, it is essential to have effective methods for analyzing and interpreting this information. The use of topic modeling in the analysis of textual data has gained popularity in recent years due to its ability to identify hidden themes and patterns within large datasets. This proposal reports a methodological advancement on how topic modeling can be used to analyze qualitative data.
This analysis explores the research question:
What can text network analysis reveal about themes from reflections educators have regarding international professional development program in the summer of 2022?
The data are from an ongoing, international, professional development program for inservice teachers. It includes three Saturday classes in the spring, two weeks of a study abroad program in the summer to an international destination (in 2022, to Munich Germany) and one follow up session in the fall. In this professional development abroad, teachers build a portfolio that represents a chosen cultural theme, including a lesson plan illustrating how they plan to apply technical cultural representation and/or analysis in their own classroom. The program aims to provide teachers with experiences practicing with cultural frames and representational tools, so they can work with their own students to elicit and represent diverse cultural identities and perspectives. After the program abroad is completed, participants come together to reflect and share about their work. They are interviewed in a focus group format.
The textual data are from transcripts of focus groups conducted with participants in the study following the international experience.
After setting up the project and installing and loading packages, the
data were read into the RStudio environment using the
read_csv() function.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ stringr 1.5.0
## ✔ tidyr 1.3.0 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidyr)
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
library(ggraph)
CIDRE_interviews <- read_csv("data/Independent Analysis Feb 21 2023.csv",
col_types = cols(
Interview_text = col_character(),
Interviewee = col_character(),
Topic = col_character()
)
)
CIDRE_interviews <- CIDRE_interviews %>%
mutate(Interview_text = strsplit(as.character(Interview_text), "\n\n\n")) %>%
unnest(Interview_text)
CIDRE_interviews
## # A tibble: 487 × 3
## Interview_text Inter…¹ Topic
## <chr> <chr> <chr>
## 1 "Since there was a revenue Okay Can I go now or do you mean no… Chris,… Mary…
## 2 "definitely show your pictures." Chris,… Mary…
## 3 "I’d love to see your picture." Chris,… Mary…
## 4 "This is my photo of my personal experience. It was taken by C… Chris,… Mary…
## 5 "for the tape No it's a very it's a very like peaceful setting… Chris,… Mary…
## 6 "Yeah, for the tape, it’s a swan, surrounded by its own feces … Chris,… Mary…
## 7 "making sure that she remembered in the photo that they were" Chris,… Mary…
## 8 "…so this is the German theatre Museum. This one I went to on … Chris,… Mary…
## 9 "Those are very grown up answers." Chris,… Mary…
## 10 "We have to be grown up at some point." Chris,… Mary…
## # … with 477 more rows, and abbreviated variable name ¹Interviewee
Next, the data was tokenized into bigrams using tidytext functions.
ct_bigrams <- CIDRE_interviews %>%
unnest_tokens(bigram, Interview_text, token = "ngrams", n = 2)
ct_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 19,177 × 2
## bigram n
## <chr> <int>
## 1 it was 230
## 2 you know 203
## 3 and i 194
## 4 i was 182
## 5 i think 160
## 6 kind of 154
## 7 and then 142
## 8 like i 136
## 9 of the 124
## 10 so i 119
## # … with 19,167 more rows
Stop words were removed to eliminate words that do not carry significant meaning, thereby allowing the model to focus on more important words. Custom stop words were included to improve the analysis.
stop_words[stop_words$word=="yeah",]
## # A tibble: 0 × 2
## # … with 2 variables: word <chr>, lexicon <chr>
This chart shows the number of occurrences of bigrams. While this data story would be more effective with further customization of stop words, the phrase which received the highest occurrence, world war, reflects the topic of World War II, a focus of the research of many participants in the program. Three items in the top ten occurrences were related to culturally responsive pedagogy: bucket list, comfort zone, and cultural understanding.
bigrams_separated <- ct_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>% unite (bigram,word1,word2, sep = " ", remove=FALSE) %>%
count(word1, word2, bigram, sort = TRUE) %>% arrange (desc(n))
bigram_counts
## # A tibble: 1,546 × 4
## word1 word2 bigram n
## <chr> <chr> <chr> <int>
## 1 world war world war 11
## 2 documentation center documentation center 10
## 3 north carolina north carolina 10
## 4 yeah yeah yeah yeah 10
## 5 lesson plan lesson plan 8
## 6 youth library youth library 8
## 7 bucket list bucket list 7
## 8 comfort zone comfort zone 7
## 9 cultural understanding cultural understanding 7
## 10 international youth international youth 7
## # … with 1,536 more rows
p <- ggplot(head (bigram_counts, 10))+
geom_col(aes(x = bigram, y = n), fill="cornflowerblue", width= 0.6)
p+coord_flip()
p
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united
## # A tibble: 1,825 × 3
## Interviewee Topic bigram
## <chr> <chr> <chr>
## 1 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing i’d love
## 2 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing personal exp…
## 3 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing amazing expe…
## 4 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing lifelong fri…
## 5 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing friends erica
## 6 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing callie pushed
## 7 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing comfort zone
## 8 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing safe space
## 9 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing ate drank
## 10 Chris, Katilin M, Karen Rose, Erica, Callie Mary Visual Sharing laughed lots
## # … with 1,815 more rows
bigram_graph_filtered <- bigram_counts %>%
filter(n > 1) %>%
graph_from_data_frame()
## Warning in graph_from_data_frame(.): In `d' `NA' elements were replaced with
## string "NA"
bigram_graph_filtered
## IGRAPH 4dbb652 DN-- 208 140 --
## + attr: name (v/c), bigram (e/c), n (e/n)
## + edges from 4dbb652 (vertex names):
## [1] world ->war documentation->center
## [3] north ->carolina yeah ->yeah
## [5] lesson ->plan youth ->library
## [7] bucket ->list comfort ->zone
## [9] cultural ->understanding international->youth
## [11] war ->ii global ->community
## [13] ice ->cream makes ->sense
## [15] personal ->experience public ->transportation
## + ... omitted several edges
Word network after filtering:
set.seed(100)
a <- grid::arrow(type = "closed", length = unit(.2, "inches"))
ggraph(bigram_graph_filtered, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "red", size = 3) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
There were limitations to this analysis. First, the bigrams generated provided limited effectiveness for addressing this research question. The bigrams generated mostly reflected compound nouns such as North Carolina and collocations, or two words that are often used together, but was not particularly generative for making meaning of the research question. Furthermore, the stop words list seemed overly extensive. If this analysis were to be recreated, it would be worthwhile to scrutinize the dictionary of stop words used.
However, it was useful to examine which words in the transcripts tend to follow others immediately, or that tend to co-occur within the transcripts. The occurrence of bigrams such as “global perspectives,” “beer garden,” and “dumpster fire” are relevant to the research question and are meaningful as bigrams but would not hold the same meaning as separate tokens. These bigrams reflect themes uncovered in qualitative thematic analysis conducted by researchers on this data. In this way, conducting this analysis reveals how text mining can be used as an analytic approach that supports qualitative research. This analysis has served as a valuable stepping stone toward a deeper understanding of the research question.