Topic Modeling Badge

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply a data analysis technique introduced in this learning lab.

Part I: Reflect and Plan

Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.

Provide an APA citation for your selected study.
- Hwang, S., & Cho, E. (2021). Exploring latent topics and research trends in mathematics teachers’ knowledge using topic modeling: A systematic review. Mathematics, 9(22), 2956.
How does topic modeling address research questions?
- The authors wanted to know the major research topics in mathematics teacher education research in the past 40 years, so they used text mining techniques (including topic modeling) to review articles from teacher education journals that focused on mathematics. They used an LDA model to identify 11 relevant categories, which they consolidated into 4 main topics with qualitative coding.

Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions:

What text data would need to be collected?
- A research question I might be able to study with topic modeling is which strategies do middle school students use to solve problems about proportional relationships in a collaborative digital platform. In this research question, the topics I am modeling would be student strategies. The data I would need would be students’ typed responses to problems about proportional reasoning, which I can isolate from log data of their use in the platform (specifically from TEXT_TOOL_CHANGE events). I would need to make sure that the text tiles I used only contained student text, and not text that was copied from the curriculum (I could use difference functions in Python to create the original database, or there are similar diff() functions in R).
For what reason would text data need to be collected in order to address this question?
- In addition to being able to create representations like graphs, tables, and drawings, students record details about their strategies by typing in the digital collaborative platform. Their typed responses contain the most information about their strategies, so these are already text.
Explain the analytical level at which these text data would need to be collected and analyzed.
- One student’s response to one problem would be considered a document for topic analysis, and a collection of all students’ responses to the same problem would be the full collection of documents for the topic analysis. Alternatively, I could group similar types of problems into the same data set (but a “document” would still consist of one student’s response to one problem).

Part II: Data Product

Use your case study file to try a small number of topics (e.g., 3) or a large number of topics (e.g., 30) and explain how changing number of topics shape the way you interpret results.

I highly recommend creating a new R script in your lab-3 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.

# load libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)

## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

library(ldatuning)
library(knitr)
library(LDAvis)
library(stringr) #library that might help me remove HTML text

# import discussion post data
ts_forum_data <- read_csv("data/ts_forum_data.csv", 
     col_types = cols(course_id = col_character(),
                   forum_id = col_character(), 
                   discussion_id = col_character(), 
                   post_id = col_character()
                   )
    )

# Trying to remove HTML text from post_content column
ts_forum_data$post_content <- str_remove_all(ts_forum_data$post_content, "<.*?>")

# tokenize (into words), remove stop words and custom stop words
forums_tidy <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "li") %>%
  filter(!word == "href") %>%
  filter(!word == "_blank") %>%
  filter(!word == "https") %>%
  filter(!word == "0px") %>%
  filter(!word == "http")

# create document term matrix from tidy forum data, then output the metrics
forums_dtm <- forums_tidy %>%
  count(post_id, word) %>%
  cast_dtm(post_id, word, n)
forums_dtm

## <<DocumentTermMatrix (documents: 5765, terms: 12972)>>
## Non-/sparse entries: 133717/74649863
## Sparsity           : 100%
## Maximal term length: 320
## Weighting          : term frequency (tf)

# data processing (remove capitals, punctuation, custom stop words) - For this one I am NOT stemming words, because it didn't seem to increase the accuracy and made topics more confusing to interpret
temp <- textProcessor(ts_forum_data$post_content, 
                    metadata = ts_forum_data,  
                    lowercase=TRUE, 
                    removestopwords=TRUE, 
                    removenumbers=TRUE,  
                    removepunctuation=TRUE, 
                    wordLengths=c(3,Inf),
                    stem=FALSE,
                    onlycharacter= FALSE, 
                    striphtml=TRUE, 
                    customstopwords=c("li","href","_blank","https", "http","0px"))

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Remove Custom Stopwords...
## Removing numbers... 
## Creating Output...

# getting inputs we will need for the stm() function from temp
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents

# Latent Dirichlet allocation with 6 topics (probably too small a number for this dataset, the case study estimated 14)
forums_lda <- LDA(forums_dtm, 
                  k = 6, 
                  control = list(seed = 588)
                  )

# LDA: show the top 5 terms for each topic
terms(forums_lda, 5)

##      Topic 1     Topic 2      Topic 3      Topic 4    Topic 5      Topic 6   
## [1,] "students"  "students"   "statistics" "students" "statistics" "students"
## [2,] "data"      "statistics" "unit"       "data"     "math"       "data"    
## [3,] "question"  "questions"  "resources"  "school"   "school"     "task"    
## [4,] "questions" "stats"      "teaching"   "class"    "teach"      "sample"  
## [5,] "task"      "ap"         "learning"   "real"     "grade"      "activity"

#tidy the LDA model into a new dataframe that includes the beta values (probabilities)
tidy_lda <- tidy(forums_lda)

#LDA: show the top 5 terms for each topic visually
top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

#Beta and gamma values for LDA approach - to understand prevalence of each topic in the full data set AND which words contribute to which topic
td_beta <- tidy(forums_lda)
td_gamma <- tidy(forums_lda, matrix = "gamma")

#Prevalence table for LDA approach
top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.

gamma_terms <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

gamma_terms %>%
  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))

Topic	Expected topic proportion	Top 7 terms
Topic 2	0.245	students, statistics, questions, stats, ap, class, teach
Topic 4	0.212	students, data, school, class, real, time, agree
Topic 1	0.212	students, data, question, questions, task, statistical, results
Topic 3	0.117	statistics, unit, resources, teaching, learning, mooc, data
Topic 5	0.110	statistics, math, school, teach, grade, teachers, teaching
Topic 6	0.105	students, data, task, sample, activity, sampling, coke

# Structural topic model with 6 topics (probably too small a number for this dataset, the case study estimated 14)
forums_stm <- stm(documents=docs, 
         data=meta,
         vocab=vocab, 
         prevalence =~ course_id + forum_id,
         K=6,
         max.em.its=25,
         verbose = FALSE)

# STM: create a plot for each topic with the top 5 terms
plot.STM(forums_stm, n=5)

#STM: create a plot for each topic with the top 7 terms
plot.STM(forums_stm, n=7)

#STA visualization from LDAvis explorer
toLDAvis(mod = forums_stm, docs = docs)

## Loading required namespace: servr

Explanation of results: I tested both the LDA method and the STM method for k=6, and the results were different from both methods. Via the LDA method it was difficult to tell the different categories apart, since they contained significant overlap between terms (i.e. “students” or “statistics” are the top term in every category). The STM method resulted in slightly less repeated terms because of the predictive variables we were able to use in that method (i.e. which discussion post), so I will use the results of the STM method for interpretation.

When using 6 categories instead of 20 like in our case study, some topics became more clearly separated (topics 2, 3, 5, and 6 from the STM method), although there was still some overlap between terms in topics 1 and 4. This is easiest to see on the LDAvis explorer we created. Topic 1 pertains to the teachers’ learning activities from the course and how they perceived them. Topic 2 pertained to prior knowledge about the students they teach. Topic 3 pertains to technology, but includes teachers’ own experiences and their attitudes towards using technology with students. Topic 4 has some overlap with Topic 2, but relates more closely to statistics learning activities for students. Topic 5 is closely related to Topic 2 and includes insight into students but also includes reflections on school, curriculum, and policy and how that relates to students and teaching. Topic 6 relates to statistics skills and, but includes comments that talk about the teachers’ own skills and students’ skills. There seems to be a lot of conceptual overlap between these categories still, which indicates that my choice of k=6 topics is still likely too small. A better number of topics would be somewhere between k=6 and k=20 from the case study.

Knit & Submit

Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps in the orientation to submit your work for review.

Topic Modeling Badge

LASER Institute TM Learning Lab 3

Taren Going

July 21, 2023

Part I: Reflect and Plan

Part II: Data Product

Knit & Submit