Topic Modeling Badge

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

Part I: Reflect and Plan

Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.

Provide an APA citation for your selected study.
- Lucy, L., Demszky, D., Bromley, P., & Jurafsky, D. (2020). Content analysis of textbooks via natural language processing: Findings on gender, race, and ethnicity in Texas U.S. history textbooks. AERA Open, 6(3). https://doi.org/10.1177/2332858420940312
How does topic modeling address research questions?
- What are prominent topics and how are they related to groups of people?

Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions:

What text data would need to be collected?
- school board meeting minutes
For what reason would text data need to be collected in order to address this question?
- to detect (or identify) trends and patterns of topics over time with critical events
Explain the analytical level at which these text data would need to be collected and analyzed.
- by constructing topics from words

Part II: Data Product

Use your case study file to try a small number of topics (e.g., 3) or a large number of topics (e.g., 30) and explain how changing number of topics shape the way you interpret results.

I highly recommend creating a new R script in your lab-3 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.

too broad or too narrow
needs some cautions to set the number of topic

# YOUR FINAL CODE HERE
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)

## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

library(ldatuning)
library(knitr)
library(LDAvis)
# data
ts_forum_data <- read_csv("data/ts_forum_data.csv", 
                          col_types = cols(course_id = col_character(),
                                           forum_id = col_character(), 
                                           discussion_id = col_character(), 
                                           post_id = col_character()
                          )
)
# DTM: document term matrix
forums_tidy <- ts_forum_data %>%
   unnest_tokens(output = word, input = post_content) %>%
   anti_join(stop_words, by = "word")
forums_tidy %>%
   count(word, sort = TRUE)

## # A tibble: 13,620 × 2
##    word           n
##    <chr>      <int>
##  1 students    6841
##  2 data        4365
##  3 statistics  3103
##  4 school      1488
##  5 questions   1470
##  6 class       1426
##  7 font        1311
##  8 span        1267
##  9 time        1253
## 10 style       1150
## # ℹ 13,610 more rows

forums_dtm <- forums_tidy %>%
   count(post_id, word) %>%
   cast_dtm(post_id, word, n)
forum_quotes <- ts_forum_data %>%
   select(post_content) %>% 
   filter(grepl('time', post_content))
# stemming; like boolean search
# Structural Topic Modeling; STM
stemmed_forums <- ts_forum_data %>%
   unnest_tokens(output = word, input = post_content) %>%
   anti_join(stop_words, by = "word") %>%
   mutate(stem = wordStem(word))
stemmed_dtm <- ts_forum_data %>%
   unnest_tokens(output = word, input = post_content) %>%
   anti_join(stop_words, by = "word") %>%
   mutate(stem = wordStem(word)) %>%
   count(post_id, stem) %>%
   cast_dtm(post_id, stem, n)
# Latent Dirichlet Allocation
lda3 <- LDA(forums_dtm, 
            k = 3, # number of topics
            control = list(seed = 588)
) 
lda30 <- LDA(forums_dtm, 
            k = 30, # number of topics
            control = list(seed = 588)
) 
lda3; lda30

## A LDA_VEM topic model with 3 topics.

## A LDA_VEM topic model with 30 topics.

temp <- textProcessor(ts_forum_data$post_content, 
                      metadata = ts_forum_data,  # dataframe
                      lowercase=TRUE, 
                      removestopwords=TRUE, 
                      removenumbers=TRUE,  
                      removepunctuation=TRUE, 
                      wordLengths=c(3,Inf),
                      stem=TRUE,
                      onlycharacter= FALSE, 
                      striphtml=TRUE, 
                      customstopwords=NULL)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
stm3 <- stm(documents=docs, 
                  data=meta,
                  vocab=vocab, 
                  prevalence =~ course_id + forum_id, # covariates
                  K=3,
                  max.em.its=25,
                  verbose = FALSE)
stm30 <- stm(documents=docs, 
             data=meta,
             vocab=vocab, 
             prevalence =~ course_id + forum_id, # covariates
             K=30,
             max.em.its=25,
             verbose = FALSE)
plot.STM(stm3, n = 5)

plot.STM(stm30, n = 5)

toLDAvis(mod = stm3, docs = docs)

## Loading required namespace: servr

toLDAvis(mod = stm30, docs = docs) 
terms(lda3, 5)

##      Topic 1  Topic 2      Topic 3     
## [1,] "font"   "statistics" "students"  
## [2,] "span"   "href"       "data"      
## [3,] "style"  "li"         "statistics"
## [4,] "text"   "strong"     "questions" 
## [5,] "normal" "https"      "school"

terms(lda30, 5)

##      Topic 1    Topic 2      Topic 3   Topic 4      Topic 5    Topic 6   
## [1,] "students" "resources"  "kids"    "statistics" "time"     "students"
## [2,] "level"    "statistics" "english" "math"       "students" "video"   
## [3,] "levels"   "teaching"   "scores"  "teach"      "class"    "thinking"
## [4,] "size"     "unit"       "cost"    "students"   "survey"   "videos"  
## [5,] "dice"     "mooc"       "pick"    "teaching"   "explore"  "enjoyed" 
##      Topic 7    Topic 8         Topic 9      Topic 10 Topic 11    
## [1,] "school"   "students"      "students"   "li"     "test"      
## [2,] "students" "understanding" "questions"  "strong" "hypothesis"
## [3,] "middle"   "agree"         "assessment" "href"   "difference"
## [4,] "sharing"  "time"          "test"       "https"  "sample"    
## [5,] "teachers" "gapminder"     "locus"      "target" "testing"   
##      Topic 12         Topic 13   Topic 14    Topic 15    Topic 16 Topic 17
## [1,] "school"         "students" "agree"     "questions" "font"   "span"  
## [2,] "students"       "sampling" "students"  "question"  "normal" "style" 
## [3,] "social"         "answers"  "classroom" "students"  "text"   "line"  
## [4,] "time"           "sample"   "makes"     "answer"    "0px"    "height"
## [5,] "transportation" "correct"  "sense"     "start"     "style"  "font"  
##      Topic 18     Topic 19 Topic 20      Topic 21   Topic 22   Topic 23   
## [1,] "activity"   "plots"  "students"    "access"   "data"     "uijy0"    
## [2,] "students"   "data"   "task"        "excel"    "students" "ms"       
## [3,] "experiment" "graph"  "data"        "tuva"     "real"     "gj7bbf88h"
## [4,] "engaged"    "box"    "tasks"       "coasters" "sets"     "gthy0"    
## [5,] "coke"       "class"  "statistical" "roller"   "collect"  "wb9h"     
##      Topic 24      Topic 25 Topic 26      Topic 27     Topic 28     Topic 29
## [1,] "statistics"  "td"     "technology"  "online"     "activities" "div"   
## [2,] "probability" "top"    "students"    "statistics" "project"    "http"  
## [3,] "statistical" "width"  "software"    "education"  "students"   "href"  
## [4,] "grade"       "nice"   "simulations" "href"       "grade"      "https" 
## [5,] "science"     "align"  "computer"    "https"      "lesson"     "target"
##      Topic 30  
## [1,] "stats"   
## [2,] "ap"      
## [3,] "class"   
## [4,] "students"
## [5,] "school"

Topic Modeling Badge

LASER Institute TM Learning Lab 3

Lee

July 20, 2023

Part I: Reflect and Plan

Part II: Data Product

Knit & Submit