The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.
To earn a badge for each lab, you are required to respond to a set of prompts for two parts:
In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply a data analysis technique introduced in this learning lab.
Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.
Provide an APA citation for your selected study.
How does topic modeling address research questions?
Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions:
What text data would need to be collected?
For what reason would text data need to be collected in order to address this question?
Explain the analytical level at which these text data would need to be collected and analyzed.
Use your case study file to try a small number of topics (e.g., 3) or a large number of topics (e.g., 30) and explain how changing number of topics shape the way you interpret results.
I highly recommend creating a new R script in your lab-3 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.
library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(readxl)
chat<- read_excel("~/Desktop/R/Research project/project_chat/chat.xlsx")
chat_tidy <- chat %>%
unnest_tokens(output =word, input=text) %>%
anti_join(stop_words, by ="word")
chat_tidy
## # A tibble: 314 × 2
## responses word
## <chr> <chr>
## 1 id1 chat
## 2 id1 function
## 3 id2 attend
## 4 id2 library
## 5 id2 librarians
## 6 id2 online
## 7 id3 time
## 8 id3 hard
## 9 id3 time
## 10 id3 finding
## # … with 304 more rows
chat_tidy %>%
count (word, sort= TRUE)
## # A tibble: 157 × 2
## word n
## <chr> <int>
## 1 library 19
## 2 chat 17
## 3 online 15
## 4 helpful 14
## 5 librarians 11
## 6 librarian 10
## 7 research 10
## 8 uic 8
## 9 articles 6
## 10 messaging 5
## # … with 147 more rows
chat_dtm <- chat_tidy %>%
count(responses, word) %>%
cast_dtm(responses, word, n)
chat_dtm
## <<DocumentTermMatrix (documents: 33, terms: 157)>>
## Non-/sparse entries: 295/4886
## Sparsity : 94%
## Maximal term length: 13
## Weighting : term frequency (tf)
chattemp <- textProcessor(chat$text,
metadata = chat,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
stemmed_chat <- chat %>%
unnest_tokens(output = word, input = text) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word))
stemmed_chat
## # A tibble: 314 × 3
## responses word stem
## <chr> <chr> <chr>
## 1 id1 chat chat
## 2 id1 function function
## 3 id2 attend attend
## 4 id2 library librari
## 5 id2 librarians librarian
## 6 id2 online onlin
## 7 id3 time time
## 8 id3 hard hard
## 9 id3 time time
## 10 id3 finding find
## # … with 304 more rows
stemmed_dtm <- stemmed_chat %>%
unnest_tokens(output = word, input = stem) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word))
stemmed_dtm
## # A tibble: 289 × 3
## responses word stem
## <chr> <chr> <chr>
## 1 id1 chat chat
## 2 id1 function function
## 3 id2 attend attend
## 4 id2 librari librari
## 5 id2 librarian librarian
## 6 id2 onlin onlin
## 7 id3 time time
## 8 id3 hard hard
## 9 id3 time time
## 10 id3 articl articl
## # … with 279 more rows
stem_counts <- stemmed_chat %>%
unnest_tokens(output = word, input = word) %>%
anti_join(stop_words, by = "word") %>%
count(stem, sort = TRUE)
stem_counts
## # A tibble: 140 × 2
## stem n
## <chr> <int>
## 1 librarian 21
## 2 help 19
## 3 librari 19
## 4 chat 18
## 5 onlin 15
## 6 articl 10
## 7 research 10
## 8 uic 8
## 9 resourc 7
## 10 assist 6
## # … with 130 more rows
n_distinct(chat$text)
## [1] 33
chat_lda <- LDA(chat_dtm,
k = 3,
control = list(seed = 588)
)
chat_lda
## A LDA_VEM topic model with 3 topics.
docs <- chattemp$documents
meta <- chattemp$meta
vocab <- chattemp$vocab
chat_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
K=3,
max.em.its=25,
verbose = FALSE)
chat_stm
## A topic model with 3 topics, 33 documents and a 191 word dictionary.
plot.STM(chat_stm, n = 5)
Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps to submit your work for review:
Change the name of the author: in the YAML
header at the very top of this document to your name. As noted in Reproducible
Research in R, The YAML header controls the style and feel for
knitted document but doesn’t actually display in the final
output.
Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.
Commit your changes in GitHub Desktop and push them to your online GitHub repository.
Publish your HTML page the web using one of the following publishing methods:
Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.
Publishing on GitHub using either GitHub Pages or the HTML previewer.
Post a new discussion on GitHub to our Text mining Badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.