The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.
To earn a badge for each lab, you are required to respond to a set of prompts for two parts:Â
In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply a data analysis technique introduced in this learning lab.
Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.
Provide an APA citation for your selected study.
How does topic modeling address research questions?
Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions:
What text data would need to be collected?
For what reason would text data need to be collected in order to address this question?
Explain the analytical level at which these text data would need to be collected and analyzed.
Use your case study file to try a small number of topics (e.g., 3) or a large number of topics (e.g., 30) and explain how changing number of topics shape the way you interpret results.
I highly recommend creating a new R script in your lab-3 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.
too broad or too narrow
needs some cautions to set the number of topic
# YOUR FINAL CODE HERE
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.2 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.2 âś” tibble 3.2.1
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(ldatuning)
library(knitr)
library(LDAvis)
# data
ts_forum_data <- read_csv("data/ts_forum_data.csv",
col_types = cols(course_id = col_character(),
forum_id = col_character(),
discussion_id = col_character(),
post_id = col_character()
)
)
# DTM: document term matrix
forums_tidy <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word")
forums_tidy %>%
count(word, sort = TRUE)
## # A tibble: 13,620 Ă— 2
## word n
## <chr> <int>
## 1 students 6841
## 2 data 4365
## 3 statistics 3103
## 4 school 1488
## 5 questions 1470
## 6 class 1426
## 7 font 1311
## 8 span 1267
## 9 time 1253
## 10 style 1150
## # ℹ 13,610 more rows
forums_dtm <- forums_tidy %>%
count(post_id, word) %>%
cast_dtm(post_id, word, n)
forum_quotes <- ts_forum_data %>%
select(post_content) %>%
filter(grepl('time', post_content))
# stemming; like boolean search
# Structural Topic Modeling; STM
stemmed_forums <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word))
stemmed_dtm <- ts_forum_data %>%
unnest_tokens(output = word, input = post_content) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(post_id, stem) %>%
cast_dtm(post_id, stem, n)
# Latent Dirichlet Allocation
lda3 <- LDA(forums_dtm,
k = 3, # number of topics
control = list(seed = 588)
)
lda30 <- LDA(forums_dtm,
k = 30, # number of topics
control = list(seed = 588)
)
lda3; lda30
## A LDA_VEM topic model with 3 topics.
## A LDA_VEM topic model with 30 topics.
temp <- textProcessor(ts_forum_data$post_content,
metadata = ts_forum_data, # dataframe
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
stm3 <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ course_id + forum_id, # covariates
K=3,
max.em.its=25,
verbose = FALSE)
stm30 <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ course_id + forum_id, # covariates
K=30,
max.em.its=25,
verbose = FALSE)
plot.STM(stm3, n = 5)
plot.STM(stm30, n = 5)
toLDAvis(mod = stm3, docs = docs)
## Loading required namespace: servr
toLDAvis(mod = stm30, docs = docs)
terms(lda3, 5)
## Topic 1 Topic 2 Topic 3
## [1,] "font" "statistics" "students"
## [2,] "span" "href" "data"
## [3,] "style" "li" "statistics"
## [4,] "text" "strong" "questions"
## [5,] "normal" "https" "school"
terms(lda30, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "students" "resources" "kids" "statistics" "time" "students"
## [2,] "level" "statistics" "english" "math" "students" "video"
## [3,] "levels" "teaching" "scores" "teach" "class" "thinking"
## [4,] "size" "unit" "cost" "students" "survey" "videos"
## [5,] "dice" "mooc" "pick" "teaching" "explore" "enjoyed"
## Topic 7 Topic 8 Topic 9 Topic 10 Topic 11
## [1,] "school" "students" "students" "li" "test"
## [2,] "students" "understanding" "questions" "strong" "hypothesis"
## [3,] "middle" "agree" "assessment" "href" "difference"
## [4,] "sharing" "time" "test" "https" "sample"
## [5,] "teachers" "gapminder" "locus" "target" "testing"
## Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17
## [1,] "school" "students" "agree" "questions" "font" "span"
## [2,] "students" "sampling" "students" "question" "normal" "style"
## [3,] "social" "answers" "classroom" "students" "text" "line"
## [4,] "time" "sample" "makes" "answer" "0px" "height"
## [5,] "transportation" "correct" "sense" "start" "style" "font"
## Topic 18 Topic 19 Topic 20 Topic 21 Topic 22 Topic 23
## [1,] "activity" "plots" "students" "access" "data" "uijy0"
## [2,] "students" "data" "task" "excel" "students" "ms"
## [3,] "experiment" "graph" "data" "tuva" "real" "gj7bbf88h"
## [4,] "engaged" "box" "tasks" "coasters" "sets" "gthy0"
## [5,] "coke" "class" "statistical" "roller" "collect" "wb9h"
## Topic 24 Topic 25 Topic 26 Topic 27 Topic 28 Topic 29
## [1,] "statistics" "td" "technology" "online" "activities" "div"
## [2,] "probability" "top" "students" "statistics" "project" "http"
## [3,] "statistical" "width" "software" "education" "students" "href"
## [4,] "grade" "nice" "simulations" "href" "grade" "https"
## [5,] "science" "align" "computer" "https" "lesson" "target"
## Topic 30
## [1,] "stats"
## [2,] "ap"
## [3,] "class"
## [4,] "students"
## [5,] "school"
Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps in the orientation to submit your work for review.