The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.
To earn a badge for each lab, you are required to respond to a set of prompts for two parts:
In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply a data analysis technique introduced in this learning lab.
Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.
Liu, M., Jiang, X., Zhang, B., Song, T., Yu, G., Liu, G., … & Zhou, Z. (2023). How do topics and emotions develop in elementary school children? A text mining perspective based on free-writing text over 6 years. Frontiers in Psychology, 14, 1109126.
These four themes are summarized based on topic modeling: he results show the following: (1) Children prefer to focus on the topics of school and family in elementary school; (2) With the growth of grades, the proportion of family topics continues to decline, while that of social culture topics keeps rising; (3) When describing school, family, social culture, and interest, children mostly express negative emotions, and when describing peers and ability they mostly express positive emotions; (4) As the grade increases, the emotional expression on social culture topics become negative, while that on ability and interest topics become positive, and there are more differences in emotion expression between topics in junior and senior elementary grades.
Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions:
I plan to use the abstract data from peer review journal articles as my text data in a systematic literature review study.
To get a comprehensive examination of what research has been done using the target scale (i.e., pediatric symptoms checklist)
From online research database. Use topic modeling to identify the underlying themes.
Use your case study file to try a small number of topics (e.g., 3) or a large number of topics (e.g., 30) and explain how changing number of topics shape the way you interpret results.
I highly recommend creating a new R script in your lab-3 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.
I think it is relatively easier to interpret the results with a small number of topics.There are no overlaps among topics based on output from LDAvis. Meanwhile, it is obvious that some key themes are not covered. I also noticed that it is faster to receive results with a small number of topics. There is a dominant theme (Topic 1). I think Topic 1 is similar as the lading topic in 20 classes solution about teaching statistics. This topic is also evident in the 30 classes solution. There is a lot redundancy when we the model for 30 classes. This is indicated by the LADvis output. In addition, based on the gamma output, it is hard to identify a dominant theme with the expected proportions.
# YOUR FINAL CODE HERE
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(ldatuning)
library(knitr)
library(LDAvis)
##READ IN DATA##
ts_forum_data <- read_csv("data/ts_forum_data.csv",
col_types = cols(course_id = col_character(),
forum_id = col_character(),
discussion_id = col_character(),
post_id = col_character()
)
)
##Tokenize forums##
forums_tidy <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word")
## sort up word by count##
forums_tidy %>%
count(word, sort = TRUE)
## # A tibble: 13,620 × 2
## word n
## <chr> <int>
## 1 students 6841
## 2 data 4365
## 3 statistics 3103
## 4 school 1488
## 5 questions 1470
## 6 class 1426
## 7 font 1311
## 8 span 1267
## 9 time 1253
## 10 style 1150
## # ℹ 13,610 more rows
##filter responses to get a general idea of the themes##
forum_quotes <- ts_forum_data %>%
select(post_content) %>%
filter(grepl('agree', post_content))
sample_n(forum_quotes,10)
## # A tibble: 10 × 1
## post_content
## <chr>
## 1 My eight year old daughter was sitting beside me and asked what I was workin…
## 2 I agree Rachel The graphics and data exercises are much more accessible fo…
## 3 I like what you said about video two how the students completed all three l…
## 4 I agree I can't wait to use the data sets with my students. I loved the vid…
## 5 I agree that it is difficult to get students to pose questions. I teach onli…
## 6 I liked the Harry Potter one the best. I didn't agree with all the choices i…
## 7 I agree that some questions were not good questions and a bit confusing. Hav…
## 8 I found that 8 question quiz very difficult and quite humbling. I have becom…
## 9 I agree with you it was nice to see the different levels and methods instead…
## 10 I agree. It is important also that students are taught how to think critica…
##assign to variable forum_dtm##
forums_dtm <- forums_tidy %>%
count(post_id, word) %>%
cast_dtm(post_id, word, n)
class(forums_dtm)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
##
temp <- textProcessor(ts_forum_data$post_content,
metadata = ts_forum_data,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
##stem words##
stemmed_forums <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word))
stemmed_forums
## # A tibble: 192,160 × 15
## course_id course_name forum_id forum_name discussion_id discussion_name
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## 2 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## 3 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## 4 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## 5 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## 6 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## 7 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## 8 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## 9 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## 10 9 Teaching Statist… 126 Investiga… 6822 Not much compa…
## # ℹ 192,150 more rows
## # ℹ 9 more variables: discussion_creator <dbl>, discussion_poster <dbl>,
## # discussion_reference <chr>, parent_id <dbl>, post_date <chr>,
## # post_id <chr>, post_title <chr>, word <chr>, stem <chr>
stemmed_dtm <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(post_id, stem) %>%
cast_dtm(post_id, stem, n)
stemmed_dtm
## <<DocumentTermMatrix (documents: 5766, terms: 10001)>>
## Non-/sparse entries: 136185/57529581
## Sparsity : 100%
## Maximal term length: NA
## Weighting : term frequency (tf)
stemmed_count <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(stem, sort = TRUE)
stemmed_count
## # A tibble: 10,001 × 2
## stem n
## <chr> <int>
## 1 student 7354
## 2 data 4365
## 3 statist 4161
## 4 question 2470
## 5 teach 1858
## 6 class 1738
## 7 school 1606
## 8 time 1457
## 9 learn 1372
## 10 font 1311
## # ℹ 9,991 more rows
## 3 classes##
## change the K to 3##
## vary forums_lda number##
forums_lda_3 <- LDA(forums_dtm,
k = 3,
control = list(seed = 588)
)
forums_stm_3 <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ course_id + forum_id,
K=3,
max.em.its=25,
verbose = FALSE)
plot.STM(forums_stm_3, n = 5)
## Visualize topics##
toLDAvis(mod = forums_stm_3, docs = docs)
## Loading required namespace: servr
##beta&gamma##
td_beta_3 <- tidy(forums_lda_3)
td_gamma_3 <- tidy(forums_lda_3, matrix = "gamma")
td_beta_3
## # A tibble: 40,860 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 2015 1.98e- 4
## 2 2 2015 5.59e- 4
## 3 3 2015 5.69e- 5
## 4 1 21 1.44e-40
## 5 2 21 1.32e- 4
## 6 3 21 1.29e-17
## 7 1 beginning 5.02e- 5
## 8 2 beginning 1.34e- 4
## 9 3 beginning 8.14e- 4
## 10 1 content 5.24e- 4
## # ℹ 40,850 more rows
td_gamma_3
## # A tibble: 17,298 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 11295 1 0.00335
## 2 12711 1 0.000413
## 3 12725 1 0.0717
## 4 12733 1 0.00393
## 5 12743 1 0.0146
## 6 12744 1 0.00688
## 7 12756 1 0.0717
## 8 12757 1 0.00500
## 9 12775 1 0.110
## 10 12816 1 0.00500
## # ℹ 17,288 more rows
top_terms_3 <- td_beta_3 %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.
gamma_terms_3 <- td_gamma_3 %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms_3, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms_3 %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
| Topic | Expected topic proportion | Top 7 terms |
|---|---|---|
| Topic 3 | 0.780 | students, data, statistics, questions, school, class, time |
| Topic 2 | 0.176 | statistics, href, li, strong, https, resources, target |
| Topic 1 | 0.044 | font, span, style, text, normal, 0px, height |
plot(forums_stm_3, n = 7)
## 30 classes##
## change the K to 30##
## vary forums_lda number##
forums_lda_30 <- LDA(forums_dtm,
k = 30,
control = list(seed = 588)
)
forums_stm_30 <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ course_id + forum_id,
K=30,
max.em.its=25,
verbose = FALSE)
plot.STM(forums_stm_30, n = 5)
## Visualize topics##
toLDAvis(mod = forums_stm_30, docs = docs)
##beta&gamma##
td_beta_30 <- tidy(forums_lda_30)
td_gamma_30 <- tidy(forums_lda_30, matrix = "gamma")
td_beta_30
## # A tibble: 408,600 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 2015 3.94e-107
## 2 2 2015 1.00e- 3
## 3 3 2015 7.14e- 11
## 4 4 2015 4.13e-111
## 5 5 2015 1.27e- 28
## 6 6 2015 1.14e- 79
## 7 7 2015 1.66e- 35
## 8 8 2015 5.79e- 27
## 9 9 2015 1.47e- 4
## 10 10 2015 9.24e- 5
## # ℹ 408,590 more rows
td_gamma_30
## # A tibble: 172,980 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 11295 1 0.00201
## 2 12711 1 0.000259
## 3 12725 1 0.0211
## 4 12733 1 0.00233
## 5 12743 1 0.00746
## 6 12744 1 0.00392
## 7 12756 1 0.0211
## 8 12757 1 0.00292
## 9 12775 1 0.00292
## 10 12816 1 0.00292
## # ℹ 172,970 more rows
top_terms_30 <- td_beta_30 %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.
gamma_terms_30 <- td_gamma_30 %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms_30, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms_30 %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
| Topic | Expected topic proportion | Top 7 terms |
|---|---|---|
| Topic 4 | 0.067 | statistics, math, teach, students, teaching, school, level |
| Topic 22 | 0.059 | data, students, real, sets, collect, analysis, collection |
| Topic 2 | 0.058 | resources, statistics, teaching, unit, mooc, learning, teachers |
| Topic 20 | 0.049 | students, task, data, tasks, statistical, cycle, question |
| Topic 15 | 0.047 | questions, question, students, answer, start, thinking, posing |
| Topic 8 | 0.046 | students, understanding, agree, time, gapminder, standard, calculations |
| Topic 9 | 0.045 | students, questions, assessment, test, locus, understand, understanding |
| Topic 30 | 0.044 | stats, ap, class, students, school, math, stat |
| Topic 6 | 0.041 | students, video, thinking, videos, enjoyed, skills, critical |
| Topic 7 | 0.041 | school, students, middle, sharing, teachers, statistical, tools |
| Topic 28 | 0.040 | activities, project, students, grade, lesson, class, plan |
| Topic 14 | 0.038 | agree, students, classroom, makes, sense, hands, real |
| Topic 26 | 0.037 | technology, students, software, simulations, computer, calculator, tools |
| Topic 18 | 0.036 | activity, students, experiment, engaged, coke, pepsi, class |
| Topic 5 | 0.036 | time, students, class, survey, explore, topic, student |
| Topic 1 | 0.036 | students, level, levels, size, dice, sample, trials |
| Topic 24 | 0.032 | statistics, probability, statistical, grade, science, teaching, teach |
| Topic 13 | 0.030 | students, sampling, answers, sample, correct, population, results |
| Topic 11 | 0.028 | test, hypothesis, difference, sample, testing, chance, results |
| Topic 12 | 0.024 | school, students, social, time, transportation, media, studies |
| Topic 10 | 0.023 | li, strong, href, https, target, _blank, statistics |
| Topic 19 | 0.022 | plots, data, graph, box, class, graphs, median |
| Topic 21 | 0.021 | access, excel, tuva, coasters, roller, steel, statcrunch |
| Topic 16 | 0.018 | font, normal, text, 0px, style, color, rgb |
| Topic 29 | 0.018 | div, http, href, https, target, amp, _blank |
| Topic 27 | 0.015 | online, statistics, education, href, https, mathematics, http |
| Topic 3 | 0.014 | kids, english, scores, cost, pick, agreed, stick |
| Topic 17 | 0.014 | span, style, line, height, font, quot, size |
| Topic 25 | 0.013 | td, top, width, nice, align, easy, tr |
| Topic 23 | 0.007 | uijy0, ms, gj7bbf88h, gthy0, wb9h, 9, uijndkm77bbf8apif99h |
plot(forums_stm_30, n = 7)
Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps in the orientation to submit your work for review.