The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.
To earn a badge for each lab, you are required to respond to a set of prompts for two parts:
In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply a data analysis technique introduced in this learning lab.
Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.
Provide an APA citation for your selected study.
How does topic modeling address research questions?
Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions:
What text data would need to be collected?
For what reason would text data need to be collected in order to address this question?
Explain the analytical level at which these text data would need to be collected and analyzed.
Use your case study file to try a small number of topics (e.g., 3) or a large number of topics (e.g., 30) and explain how changing number of topics shape the way you interpret results.
I highly recommend creating a new R script in your lab-3 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.
# load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(ldatuning)
library(knitr)
library(LDAvis)
library(stringr) #library that might help me remove HTML text
# import discussion post data
ts_forum_data <- read_csv("data/ts_forum_data.csv",
col_types = cols(course_id = col_character(),
forum_id = col_character(),
discussion_id = col_character(),
post_id = col_character()
)
)
# Trying to remove HTML text from post_content column
ts_forum_data$post_content <- str_remove_all(ts_forum_data$post_content, "<.*?>")
# tokenize (into words), remove stop words and custom stop words
forums_tidy <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "li") %>%
filter(!word == "href") %>%
filter(!word == "_blank") %>%
filter(!word == "https") %>%
filter(!word == "0px") %>%
filter(!word == "http")
# create document term matrix from tidy forum data, then output the metrics
forums_dtm <- forums_tidy %>%
count(post_id, word) %>%
cast_dtm(post_id, word, n)
forums_dtm
## <<DocumentTermMatrix (documents: 5765, terms: 12972)>>
## Non-/sparse entries: 133717/74649863
## Sparsity : 100%
## Maximal term length: 320
## Weighting : term frequency (tf)
# data processing (remove capitals, punctuation, custom stop words) - For this one I am NOT stemming words, because it didn't seem to increase the accuracy and made topics more confusing to interpret
temp <- textProcessor(ts_forum_data$post_content,
metadata = ts_forum_data,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=FALSE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=c("li","href","_blank","https", "http","0px"))
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Remove Custom Stopwords...
## Removing numbers...
## Creating Output...
# getting inputs we will need for the stm() function from temp
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
# Latent Dirichlet allocation with 6 topics (probably too small a number for this dataset, the case study estimated 14)
forums_lda <- LDA(forums_dtm,
k = 6,
control = list(seed = 588)
)
# LDA: show the top 5 terms for each topic
terms(forums_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "students" "students" "statistics" "students" "statistics" "students"
## [2,] "data" "statistics" "unit" "data" "math" "data"
## [3,] "question" "questions" "resources" "school" "school" "task"
## [4,] "questions" "stats" "teaching" "class" "teach" "sample"
## [5,] "task" "ap" "learning" "real" "grade" "activity"
#tidy the LDA model into a new dataframe that includes the beta values (probabilities)
tidy_lda <- tidy(forums_lda)
#LDA: show the top 5 terms for each topic visually
top_terms <- tidy_lda %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
#Beta and gamma values for LDA approach - to understand prevalence of each topic in the full data set AND which words contribute to which topic
td_beta <- tidy(forums_lda)
td_gamma <- tidy(forums_lda, matrix = "gamma")
#Prevalence table for LDA approach
top_terms <- td_beta %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.
gamma_terms <- td_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
gamma_terms %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
| Topic | Expected topic proportion | Top 7 terms |
|---|---|---|
| Topic 2 | 0.245 | students, statistics, questions, stats, ap, class, teach |
| Topic 4 | 0.212 | students, data, school, class, real, time, agree |
| Topic 1 | 0.212 | students, data, question, questions, task, statistical, results |
| Topic 3 | 0.117 | statistics, unit, resources, teaching, learning, mooc, data |
| Topic 5 | 0.110 | statistics, math, school, teach, grade, teachers, teaching |
| Topic 6 | 0.105 | students, data, task, sample, activity, sampling, coke |
# Structural topic model with 6 topics (probably too small a number for this dataset, the case study estimated 14)
forums_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ course_id + forum_id,
K=6,
max.em.its=25,
verbose = FALSE)
# STM: create a plot for each topic with the top 5 terms
plot.STM(forums_stm, n=5)
#STM: create a plot for each topic with the top 7 terms
plot.STM(forums_stm, n=7)
#STA visualization from LDAvis explorer
toLDAvis(mod = forums_stm, docs = docs)
## Loading required namespace: servr
Explanation of results: I tested both the LDA method and the STM method for k=6, and the results were different from both methods. Via the LDA method it was difficult to tell the different categories apart, since they contained significant overlap between terms (i.e. “students” or “statistics” are the top term in every category). The STM method resulted in slightly less repeated terms because of the predictive variables we were able to use in that method (i.e. which discussion post), so I will use the results of the STM method for interpretation.
When using 6 categories instead of 20 like in our case study, some topics became more clearly separated (topics 2, 3, 5, and 6 from the STM method), although there was still some overlap between terms in topics 1 and 4. This is easiest to see on the LDAvis explorer we created. Topic 1 pertains to the teachers’ learning activities from the course and how they perceived them. Topic 2 pertained to prior knowledge about the students they teach. Topic 3 pertains to technology, but includes teachers’ own experiences and their attitudes towards using technology with students. Topic 4 has some overlap with Topic 2, but relates more closely to statistics learning activities for students. Topic 5 is closely related to Topic 2 and includes insight into students but also includes reflections on school, curriculum, and policy and how that relates to students and teaching. Topic 6 relates to statistics skills and, but includes comments that talk about the teachers’ own skills and students’ skills. There seems to be a lot of conceptual overlap between these categories still, which indicates that my choice of k=6 topics is still likely too small. A better number of topics would be somewhere between k=6 and k=20 from the case study.
Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps in the orientation to submit your work for review.