The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts: 

Part I: Reflect and Plan

Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.

  1. Provide an APA citation for your selected study.

Liu, M., Jiang, X., Zhang, B., Song, T., Yu, G., Liu, G., … & Zhou, Z. (2023). How do topics and emotions develop in elementary school children? A text mining perspective based on free-writing text over 6 years. Frontiers in Psychology, 14, 1109126.

  1. How does topic modeling address research questions?

These four themes are summarized based on topic modeling: he results show the following: (1) Children prefer to focus on the topics of school and family in elementary school; (2) With the growth of grades, the proportion of family topics continues to decline, while that of social culture topics keeps rising; (3) When describing school, family, social culture, and interest, children mostly express negative emotions, and when describing peers and ability they mostly express positive emotions; (4) As the grade increases, the emotional expression on social culture topics become negative, while that on ability and interest topics become positive, and there are more differences in emotion expression between topics in junior and senior elementary grades.

Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions:

  1. What text data would need to be collected?

I plan to use the abstract data from peer review journal articles as my text data in a systematic literature review study.

  1. For what reason would text data need to be collected in order to address this question?

To get a comprehensive examination of what research has been done using the target scale (i.e., pediatric symptoms checklist)

  1. Explain the analytical level at which these text data would need to be collected and analyzed.

From online research database. Use topic modeling to identify the underlying themes.

Part II: Data Product

Use your case study file to try a small number of topics (e.g., 3) or a large number of topics (e.g., 30) and explain how changing number of topics shape the way you interpret results.

I highly recommend creating a new R script in your lab-3 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.

I think it is relatively easier to interpret the results with a small number of topics.There are no overlaps among topics based on output from LDAvis. Meanwhile, it is obvious that some key themes are not covered. I also noticed that it is faster to receive results with a small number of topics. There is a dominant theme (Topic 1). I think Topic 1 is similar as the lading topic in 20 classes solution about teaching statistics. This topic is also evident in the 30 classes solution. There is a lot redundancy when we the model for 30 classes. This is indicated by the LADvis output. In addition, based on the gamma output, it is hard to identify a dominant theme with the expected proportions.

# YOUR FINAL CODE HERE

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com
library(ldatuning)
library(knitr)
library(LDAvis)
##READ IN DATA##
ts_forum_data <- read_csv("data/ts_forum_data.csv", 
                          col_types = cols(course_id = col_character(),
                                           forum_id = col_character(), 
                                           discussion_id = col_character(), 
                                           post_id = col_character()
                          )
)

##Tokenize forums##
forums_tidy <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word")
## sort up word by count##
forums_tidy %>%
  count(word, sort = TRUE)
## # A tibble: 13,620 × 2
##    word           n
##    <chr>      <int>
##  1 students    6841
##  2 data        4365
##  3 statistics  3103
##  4 school      1488
##  5 questions   1470
##  6 class       1426
##  7 font        1311
##  8 span        1267
##  9 time        1253
## 10 style       1150
## # ℹ 13,610 more rows
##filter responses to get a general idea of the themes##
forum_quotes <- ts_forum_data %>%
  select(post_content) %>% 
  filter(grepl('agree', post_content))
sample_n(forum_quotes,10)
## # A tibble: 10 × 1
##    post_content                                                                 
##    <chr>                                                                        
##  1 My eight year old daughter was sitting beside me and asked what I was workin…
##  2 I agree Rachel   The graphics and data exercises are much more accessible fo…
##  3 I like what you said about video two  how the students completed all three l…
##  4 I agree  I can't wait to use the data sets with my students. I loved the vid…
##  5 I agree that it is difficult to get students to pose questions. I teach onli…
##  6 I liked the Harry Potter one the best. I didn't agree with all the choices i…
##  7 I agree that some questions were not good questions and a bit confusing. Hav…
##  8 I found that 8 question quiz very difficult and quite humbling. I have becom…
##  9 I agree with you it was nice to see the different levels and methods instead…
## 10 I agree.  It is important also that students are taught how to think critica…
##assign to variable forum_dtm##
forums_dtm <- forums_tidy %>%
  count(post_id, word) %>%
  cast_dtm(post_id, word, n)
class(forums_dtm)
## [1] "DocumentTermMatrix"    "simple_triplet_matrix"
## 
temp <- textProcessor(ts_forum_data$post_content, 
                      metadata = ts_forum_data,  
                      lowercase=TRUE, 
                      removestopwords=TRUE, 
                      removenumbers=TRUE,  
                      removepunctuation=TRUE, 
                      wordLengths=c(3,Inf),
                      stem=TRUE,
                      onlycharacter= FALSE, 
                      striphtml=TRUE, 
                      customstopwords=NULL)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
##stem words##
stemmed_forums <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word))

stemmed_forums
## # A tibble: 192,160 × 15
##    course_id course_name       forum_id forum_name discussion_id discussion_name
##    <chr>     <chr>             <chr>    <chr>      <chr>         <chr>          
##  1 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  2 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  3 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  4 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  5 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  6 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  7 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  8 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
##  9 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
## 10 9         Teaching Statist… 126      Investiga… 6822          Not much compa…
## # ℹ 192,150 more rows
## # ℹ 9 more variables: discussion_creator <dbl>, discussion_poster <dbl>,
## #   discussion_reference <chr>, parent_id <dbl>, post_date <chr>,
## #   post_id <chr>, post_title <chr>, word <chr>, stem <chr>
stemmed_dtm <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word)) %>%
  count(post_id, stem) %>%
  cast_dtm(post_id, stem, n)
  
  stemmed_dtm
## <<DocumentTermMatrix (documents: 5766, terms: 10001)>>
## Non-/sparse entries: 136185/57529581
## Sparsity           : 100%
## Maximal term length: NA
## Weighting          : term frequency (tf)
stemmed_count <- ts_forum_data %>%
  unnest_tokens(output = word, input = post_content) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word)) %>%
  count(stem, sort = TRUE)
  stemmed_count
## # A tibble: 10,001 × 2
##    stem         n
##    <chr>    <int>
##  1 student   7354
##  2 data      4365
##  3 statist   4161
##  4 question  2470
##  5 teach     1858
##  6 class     1738
##  7 school    1606
##  8 time      1457
##  9 learn     1372
## 10 font      1311
## # ℹ 9,991 more rows
  ## 3 classes##
  ## change the K to 3##
  ## vary forums_lda number##
  forums_lda_3 <- LDA(forums_dtm, 
                      k = 3, 
                      control = list(seed = 588)
  )
  
  forums_stm_3 <- stm(documents=docs, 
                      data=meta,
                      vocab=vocab, 
                      prevalence =~ course_id + forum_id,
                      K=3,
                      max.em.its=25,
                      verbose = FALSE)
  
  
  plot.STM(forums_stm_3, n = 5)

  ## Visualize topics##
  toLDAvis(mod = forums_stm_3, docs = docs)
## Loading required namespace: servr
  ##beta&gamma##
  
 
  td_beta_3 <- tidy(forums_lda_3)
  
  td_gamma_3 <- tidy(forums_lda_3, matrix = "gamma")
  
  td_beta_3
## # A tibble: 40,860 × 3
##    topic term          beta
##    <int> <chr>        <dbl>
##  1     1 2015      1.98e- 4
##  2     2 2015      5.59e- 4
##  3     3 2015      5.69e- 5
##  4     1 21        1.44e-40
##  5     2 21        1.32e- 4
##  6     3 21        1.29e-17
##  7     1 beginning 5.02e- 5
##  8     2 beginning 1.34e- 4
##  9     3 beginning 8.14e- 4
## 10     1 content   5.24e- 4
## # ℹ 40,850 more rows
  td_gamma_3
## # A tibble: 17,298 × 3
##    document topic    gamma
##    <chr>    <int>    <dbl>
##  1 11295        1 0.00335 
##  2 12711        1 0.000413
##  3 12725        1 0.0717  
##  4 12733        1 0.00393 
##  5 12743        1 0.0146  
##  6 12744        1 0.00688 
##  7 12756        1 0.0717  
##  8 12757        1 0.00500 
##  9 12775        1 0.110   
## 10 12816        1 0.00500 
## # ℹ 17,288 more rows
  top_terms_3 <- td_beta_3 %>%
    arrange(beta) %>%
    group_by(topic) %>%
    top_n(7, beta) %>%
    arrange(-beta) %>%
    select(topic, term) %>%
    summarise(terms = list(term)) %>%
    mutate(terms = map(terms, paste, collapse = ", ")) %>% 
    unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.
  gamma_terms_3 <- td_gamma_3 %>%
    group_by(topic) %>%
    summarise(gamma = mean(gamma)) %>%
    arrange(desc(gamma)) %>%
    left_join(top_terms_3, by = "topic") %>%
    mutate(topic = paste0("Topic ", topic),
           topic = reorder(topic, gamma))
  
  gamma_terms_3 %>%
    select(topic, gamma, terms) %>%
    kable(digits = 3, 
          col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
Topic Expected topic proportion Top 7 terms
Topic 3 0.780 students, data, statistics, questions, school, class, time
Topic 2 0.176 statistics, href, li, strong, https, resources, target
Topic 1 0.044 font, span, style, text, normal, 0px, height
  plot(forums_stm_3, n = 7)

  ## 30 classes##
  ## change the K to 30##
  ## vary forums_lda number##
  forums_lda_30 <- LDA(forums_dtm, 
                      k = 30, 
                      control = list(seed = 588)
  )
  
  forums_stm_30 <- stm(documents=docs, 
                      data=meta,
                      vocab=vocab, 
                      prevalence =~ course_id + forum_id,
                      K=30,
                      max.em.its=25,
                      verbose = FALSE)
  
  
  plot.STM(forums_stm_30, n = 5)

  ## Visualize topics##
  toLDAvis(mod = forums_stm_30, docs = docs)
  ##beta&gamma##
  
  td_beta_30 <- tidy(forums_lda_30)
  
  td_gamma_30 <- tidy(forums_lda_30, matrix = "gamma")
  
  td_beta_30
## # A tibble: 408,600 × 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 2015  3.94e-107
##  2     2 2015  1.00e-  3
##  3     3 2015  7.14e- 11
##  4     4 2015  4.13e-111
##  5     5 2015  1.27e- 28
##  6     6 2015  1.14e- 79
##  7     7 2015  1.66e- 35
##  8     8 2015  5.79e- 27
##  9     9 2015  1.47e-  4
## 10    10 2015  9.24e-  5
## # ℹ 408,590 more rows
  td_gamma_30
## # A tibble: 172,980 × 3
##    document topic    gamma
##    <chr>    <int>    <dbl>
##  1 11295        1 0.00201 
##  2 12711        1 0.000259
##  3 12725        1 0.0211  
##  4 12733        1 0.00233 
##  5 12743        1 0.00746 
##  6 12744        1 0.00392 
##  7 12756        1 0.0211  
##  8 12757        1 0.00292 
##  9 12775        1 0.00292 
## 10 12816        1 0.00292 
## # ℹ 172,970 more rows
  top_terms_30 <- td_beta_30 %>%
    arrange(beta) %>%
    group_by(topic) %>%
    top_n(7, beta) %>%
    arrange(-beta) %>%
    select(topic, term) %>%
    summarise(terms = list(term)) %>%
    mutate(terms = map(terms, paste, collapse = ", ")) %>% 
    unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.
  gamma_terms_30 <- td_gamma_30 %>%
    group_by(topic) %>%
    summarise(gamma = mean(gamma)) %>%
    arrange(desc(gamma)) %>%
    left_join(top_terms_30, by = "topic") %>%
    mutate(topic = paste0("Topic ", topic),
           topic = reorder(topic, gamma))
  
  gamma_terms_30 %>%
    select(topic, gamma, terms) %>%
    kable(digits = 3, 
          col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
Topic Expected topic proportion Top 7 terms
Topic 4 0.067 statistics, math, teach, students, teaching, school, level
Topic 22 0.059 data, students, real, sets, collect, analysis, collection
Topic 2 0.058 resources, statistics, teaching, unit, mooc, learning, teachers
Topic 20 0.049 students, task, data, tasks, statistical, cycle, question
Topic 15 0.047 questions, question, students, answer, start, thinking, posing
Topic 8 0.046 students, understanding, agree, time, gapminder, standard, calculations
Topic 9 0.045 students, questions, assessment, test, locus, understand, understanding
Topic 30 0.044 stats, ap, class, students, school, math, stat
Topic 6 0.041 students, video, thinking, videos, enjoyed, skills, critical
Topic 7 0.041 school, students, middle, sharing, teachers, statistical, tools
Topic 28 0.040 activities, project, students, grade, lesson, class, plan
Topic 14 0.038 agree, students, classroom, makes, sense, hands, real
Topic 26 0.037 technology, students, software, simulations, computer, calculator, tools
Topic 18 0.036 activity, students, experiment, engaged, coke, pepsi, class
Topic 5 0.036 time, students, class, survey, explore, topic, student
Topic 1 0.036 students, level, levels, size, dice, sample, trials
Topic 24 0.032 statistics, probability, statistical, grade, science, teaching, teach
Topic 13 0.030 students, sampling, answers, sample, correct, population, results
Topic 11 0.028 test, hypothesis, difference, sample, testing, chance, results
Topic 12 0.024 school, students, social, time, transportation, media, studies
Topic 10 0.023 li, strong, href, https, target, _blank, statistics
Topic 19 0.022 plots, data, graph, box, class, graphs, median
Topic 21 0.021 access, excel, tuva, coasters, roller, steel, statcrunch
Topic 16 0.018 font, normal, text, 0px, style, color, rgb
Topic 29 0.018 div, http, href, https, target, amp, _blank
Topic 27 0.015 online, statistics, education, href, https, mathematics, http
Topic 3 0.014 kids, english, scores, cost, pick, agreed, stick
Topic 17 0.014 span, style, line, height, font, quot, size
Topic 25 0.013 td, top, width, nice, align, easy, tr
Topic 23 0.007 uijy0, ms, gj7bbf88h, gthy0, wb9h, 9, uijndkm77bbf8apif99h
  plot(forums_stm_30, n = 7)

Knit & Submit

Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps in the orientation to submit your work for review.