Topic Modeling Badge

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply a data analysis technique introduced in this learning lab.

Part I: Reflect and Plan

Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.

Provide an APA citation for your selected study.

Liu, M., Jiang, X., Zhang, B., Song, T., Yu, G., Liu, G., … & Zhou, Z. (2023). How do topics and emotions develop in elementary school children? A text mining perspective based on free-writing text over 6 years. Frontiers in Psychology, 14, 1109126.

How does topic modeling address research questions?

These four themes are summarized based on topic modeling: he results show the following: (1) Children prefer to focus on the topics of school and family in elementary school; (2) With the growth of grades, the proportion of family topics continues to decline, while that of social culture topics keeps rising; (3) When describing school, family, social culture, and interest, children mostly express negative emotions, and when describing peers and ability they mostly express positive emotions; (4) As the grade increases, the emotional expression on social culture topics become negative, while that on ability and interest topics become positive, and there are more differences in emotion expression between topics in junior and senior elementary grades.

Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions:

What text data would need to be collected?

I plan to use the abstract data from peer review journal articles as my text data in a systematic literature review study.

For what reason would text data need to be collected in order to address this question?

To get a comprehensive examination of what research has been done using the target scale (i.e., pediatric symptoms checklist)

Explain the analytical level at which these text data would need to be collected and analyzed.

From online research database. Use topic modeling to identify the underlying themes.

Part II: Data Product

Use your case study file to try a small number of topics (e.g., 3) or a large number of topics (e.g., 30) and explain how changing number of topics shape the way you interpret results.

I highly recommend creating a new R script in your lab-3 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.

I think it is relatively easier to interpret the results with a small number of topics.There are no overlaps among topics based on output from LDAvis. Meanwhile, it is obvious that some key themes are not covered. I also noticed that it is faster to receive results with a small number of topics. There is a dominant theme (Topic 1). I think Topic 1 is similar as the lading topic in 20 classes solution about teaching statistics. This topic is also evident in the 30 classes solution. There is a lot redundancy when we the model for 30 classes. This is indicated by the LADvis output. In addition, based on the gamma output, it is hard to identify a dominant theme with the expected proportions.

# YOUR FINAL CODE HERE

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)

## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

library(ldatuning)
library(knitr)
library(LDAvis)
##READ IN DATA##
ts_forum_data <- read_csv("data/ts_forum_data.csv", 
                          col_types = cols(course_id = col_character(),
                                           forum_id = col_character(), 
                                           discussion_id = col_character(), 
                                           post_id = col_character()
                          )
)

##Tokenize forums##
forums_tidy <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word")
## sort up word by count##
forums_tidy %>%
  count(word, sort = TRUE)

## # A tibble: 13,620 × 2
##    word           n
##    <chr>      <int>
##  1 students    6841
##  2 data        4365
##  3 statistics  3103
##  4 school      1488
##  5 questions   1470
##  6 class       1426
##  7 font        1311
##  8 span        1267
##  9 time        1253
## 10 style       1150
## # ℹ 13,610 more rows

##filter responses to get a general idea of the themes##
forum_quotes <- ts_forum_data %>%
  select(post_content) %>% 
  filter(grepl('agree', post_content))
sample_n(forum_quotes,10)

## # A tibble: 10 × 1
##    post_content                                                                 
##    <chr>                                                                        
##  1 My eight year old daughter was sitting beside me and asked what I was workin…
##  2 I agree Rachel   The graphics and data exercises are much more accessible fo…
##  3 I like what you said about video two  how the students completed all three l…
##  4 I agree  I can't wait to use the data sets with my students. I loved the vid…
##  5 I agree that it is difficult to get students to pose questions. I teach onli…
##  6 I liked the Harry Potter one the best. I didn't agree with all the choices i…
##  7 I agree that some questions were not good questions and a bit confusing. Hav…
##  8 I found that 8 question quiz very difficult and quite humbling. I have becom…
##  9 I agree with you it was nice to see the different levels and methods instead…
## 10 I agree.  It is important also that students are taught how to think critica…

##assign to variable forum_dtm##
forums_dtm <- forums_tidy %>%
  count(post_id, word) %>%
  cast_dtm(post_id, word, n)
class(forums_dtm)

## [1] "DocumentTermMatrix"    "simple_triplet_matrix"

## 
temp <- textProcessor(ts_forum_data$post_content, 
                      metadata = ts_forum_data,  
                      lowercase=TRUE, 
                      removestopwords=TRUE, 
                      removenumbers=TRUE,  
                      removepunctuation=TRUE, 
                      wordLengths=c(3,Inf),
                      stem=TRUE,
                      onlycharacter= FALSE, 
                      striphtml=TRUE, 
                      customstopwords=NULL)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
##stem words##
stemmed_forums <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word))

stemmed_forums

## # A tibble: 192,160 × 15
##    course_id course_name       forum_id forum_name discussion_id discussion_name
##    <chr>     <chr>             <chr>    <chr>      <chr>         <chr>          
##  1 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  2 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  3 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  4 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  5 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  6 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  7 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  8 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  9 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
## 10 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
## # ℹ 192,150 more rows
## # ℹ 9 more variables: discussion_creator <dbl>, discussion_poster <dbl>,
## #   discussion_reference <chr>, parent_id <dbl>, post_date <chr>,
## #   post_id <chr>, post_title <chr>, word <chr>, stem <chr>

stemmed_dtm <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word)) %>%
  count(post_id, stem) %>%
  cast_dtm(post_id, stem, n)
  
  stemmed_dtm

## <<DocumentTermMatrix (documents: 5766, terms: 10001)>>
## Non-/sparse entries: 136185/57529581
## Sparsity           : 100%
## Maximal term length: NA
## Weighting          : term frequency (tf)

stemmed_count <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word)) %>%
  count(stem, sort = TRUE)
  stemmed_count

## # A tibble: 10,001 × 2
##    stem         n
##    <chr>    <int>
##  1 student   7354
##  2 data      4365
##  3 statist   4161
##  4 question  2470
##  5 teach     1858
##  6 class     1738
##  7 school    1606
##  8 time      1457
##  9 learn     1372
## 10 font      1311
## # ℹ 9,991 more rows

  ## 3 classes##
  ## change the K to 3##
  ## vary forums_lda number##
  forums_lda_3 <- LDA(forums_dtm, 
                      k = 3, 
                      control = list(seed = 588)
  )
  
  forums_stm_3 <- stm(documents=docs, 
                      data=meta,
                      vocab=vocab, 
                      prevalence =~ course_id + forum_id,
                      K=3,
                      max.em.its=25,
                      verbose = FALSE)
  
  
  plot.STM(forums_stm_3, n = 5)

  ## Visualize topics##
  toLDAvis(mod = forums_stm_3, docs = docs)

## Loading required namespace: servr

  ##beta&gamma##
  
 
  td_beta_3 <- tidy(forums_lda_3)
  
  td_gamma_3 <- tidy(forums_lda_3, matrix = "gamma")
  
  td_beta_3

## # A tibble: 40,860 × 3
##    topic term          beta
##    <int> <chr>        <dbl>
##  1     1 2015      1.98e- 4
##  2     2 2015      5.59e- 4
##  3     3 2015      5.69e- 5
##  4     1 21        1.44e-40
##  5     2 21        1.32e- 4
##  6     3 21        1.29e-17
##  7     1 beginning 5.02e- 5
##  8     2 beginning 1.34e- 4
##  9     3 beginning 8.14e- 4
## 10     1 content   5.24e- 4
## # ℹ 40,850 more rows

  td_gamma_3

## # A tibble: 17,298 × 3
##    document topic    gamma
##    <chr>    <int>    <dbl>
##  1 11295        1 0.00335 
##  2 12711        1 0.000413
##  3 12725        1 0.0717  
##  4 12733        1 0.00393 
##  5 12743        1 0.0146  
##  6 12744        1 0.00688 
##  7 12756        1 0.0717  
##  8 12757        1 0.00500 
##  9 12775        1 0.110   
## 10 12816        1 0.00500 
## # ℹ 17,288 more rows

  top_terms_3 <- td_beta_3 %>%
    arrange(beta) %>%
    group_by(topic) %>%
    top_n(7, beta) %>%
    arrange(-beta) %>%
    select(topic, term) %>%
    summarise(terms = list(term)) %>%
    mutate(terms = map(terms, paste, collapse = ", ")) %>% 
    unnest()

## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.

  gamma_terms_3 <- td_gamma_3 %>%
    group_by(topic) %>%
    summarise(gamma = mean(gamma)) %>%
    arrange(desc(gamma)) %>%
    left_join(top_terms_3, by = "topic") %>%
    mutate(topic = paste0("Topic ", topic),
           topic = reorder(topic, gamma))
  
  gamma_terms_3 %>%
    select(topic, gamma, terms) %>%
    kable(digits = 3, 
          col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))

Topic	Expected topic proportion	Top 7 terms
Topic 3	0.780	students, data, statistics, questions, school, class, time
Topic 2	0.176	statistics, href, li, strong, https, resources, target
Topic 1	0.044	font, span, style, text, normal, 0px, height

  plot(forums_stm_3, n = 7)

  ## 30 classes##
  ## change the K to 30##
  ## vary forums_lda number##
  forums_lda_30 <- LDA(forums_dtm, 
                      k = 30, 
                      control = list(seed = 588)
  )
  
  forums_stm_30 <- stm(documents=docs, 
                      data=meta,
                      vocab=vocab, 
                      prevalence =~ course_id + forum_id,
                      K=30,
                      max.em.its=25,
                      verbose = FALSE)
  
  
  plot.STM(forums_stm_30, n = 5)

  ## Visualize topics##
  toLDAvis(mod = forums_stm_30, docs = docs)
  ##beta&gamma##
  
  td_beta_30 <- tidy(forums_lda_30)
  
  td_gamma_30 <- tidy(forums_lda_30, matrix = "gamma")
  
  td_beta_30

## # A tibble: 408,600 × 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 2015  3.94e-107
##  2     2 2015  1.00e-  3
##  3     3 2015  7.14e- 11
##  4     4 2015  4.13e-111
##  5     5 2015  1.27e- 28
##  6     6 2015  1.14e- 79
##  7     7 2015  1.66e- 35
##  8     8 2015  5.79e- 27
##  9     9 2015  1.47e-  4
## 10    10 2015  9.24e-  5
## # ℹ 408,590 more rows

  td_gamma_30

## # A tibble: 172,980 × 3
##    document topic    gamma
##    <chr>    <int>    <dbl>
##  1 11295        1 0.00201 
##  2 12711        1 0.000259
##  3 12725        1 0.0211  
##  4 12733        1 0.00233 
##  5 12743        1 0.00746 
##  6 12744        1 0.00392 
##  7 12756        1 0.0211  
##  8 12757        1 0.00292 
##  9 12775        1 0.00292 
## 10 12816        1 0.00292 
## # ℹ 172,970 more rows

  top_terms_30 <- td_beta_30 %>%
    arrange(beta) %>%
    group_by(topic) %>%
    top_n(7, beta) %>%
    arrange(-beta) %>%
    select(topic, term) %>%
    summarise(terms = list(term)) %>%
    mutate(terms = map(terms, paste, collapse = ", ")) %>% 
    unnest()

## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.

  gamma_terms_30 <- td_gamma_30 %>%
    group_by(topic) %>%
    summarise(gamma = mean(gamma)) %>%
    arrange(desc(gamma)) %>%
    left_join(top_terms_30, by = "topic") %>%
    mutate(topic = paste0("Topic ", topic),
           topic = reorder(topic, gamma))
  
  gamma_terms_30 %>%
    select(topic, gamma, terms) %>%
    kable(digits = 3, 
          col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))

Topic	Expected topic proportion	Top 7 terms
Topic 4	0.067	statistics, math, teach, students, teaching, school, level
Topic 22	0.059	data, students, real, sets, collect, analysis, collection
Topic 2	0.058	resources, statistics, teaching, unit, mooc, learning, teachers
Topic 20	0.049	students, task, data, tasks, statistical, cycle, question
Topic 15	0.047	questions, question, students, answer, start, thinking, posing
Topic 8	0.046	students, understanding, agree, time, gapminder, standard, calculations
Topic 9	0.045	students, questions, assessment, test, locus, understand, understanding
Topic 30	0.044	stats, ap, class, students, school, math, stat
Topic 6	0.041	students, video, thinking, videos, enjoyed, skills, critical
Topic 7	0.041	school, students, middle, sharing, teachers, statistical, tools
Topic 28	0.040	activities, project, students, grade, lesson, class, plan
Topic 14	0.038	agree, students, classroom, makes, sense, hands, real
Topic 26	0.037	technology, students, software, simulations, computer, calculator, tools
Topic 18	0.036	activity, students, experiment, engaged, coke, pepsi, class
Topic 5	0.036	time, students, class, survey, explore, topic, student
Topic 1	0.036	students, level, levels, size, dice, sample, trials
Topic 24	0.032	statistics, probability, statistical, grade, science, teaching, teach
Topic 13	0.030	students, sampling, answers, sample, correct, population, results
Topic 11	0.028	test, hypothesis, difference, sample, testing, chance, results
Topic 12	0.024	school, students, social, time, transportation, media, studies
Topic 10	0.023	li, strong, href, https, target, _blank, statistics
Topic 19	0.022	plots, data, graph, box, class, graphs, median
Topic 21	0.021	access, excel, tuva, coasters, roller, steel, statcrunch
Topic 16	0.018	font, normal, text, 0px, style, color, rgb
Topic 29	0.018	div, http, href, https, target, amp, _blank
Topic 27	0.015	online, statistics, education, href, https, mathematics, http
Topic 3	0.014	kids, english, scores, cost, pick, agreed, stick
Topic 17	0.014	span, style, line, height, font, quot, size
Topic 25	0.013	td, top, width, nice, align, easy, tr
Topic 23	0.007	uijy0, ms, gj7bbf88h, gthy0, wb9h, 9, uijndkm77bbf8apif99h

  plot(forums_stm_30, n = 7)

Knit & Submit

Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps in the orientation to submit your work for review.

Topic Modeling Badge

LASER Institute TM Learning Lab 3

Dr. Shiyan Jiang

November 13, 2023

Part I: Reflect and Plan

Part II: Data Product

Knit & Submit