Topic Modeling Analysis Badge

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply a data analysis technique introduced in this learning lab.

Part I: Reflect and Plan

Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.

Provide an APA citation for your selected study.
- Chen, X., & Wang, H. (2019). Automated chat transcript analysis using topic modeling for library reference services. Proceedings of the Association for Information Science and Technology, 56(1), 368–371. https://doi.org/10.1002/pra2.31
How does topic modeling address research questions?
- to identify major topics occurred in library reference Q&As using chat transcripts in the last 5 years

Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions:

What text data would need to be collected?
- chat conversations stored in Excel file
For what reason would text data need to be collected in order to address this question?
- due to the large dataset, it is difficult to manually interpret the chat text.
Explain the analytical level at which these text data would need to be collected and analyzed.
- topic modeling

Part II: Data Product

Use your case study file to try a small number of topics (e.g., 3) or a large number of topics (e.g., 30) and explain how changing number of topics shape the way you interpret results.

I highly recommend creating a new R script in your lab-3 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.

library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)

library(readxl)
chat<- read_excel("~/Desktop/R/Research project/project_chat/chat.xlsx")
chat_tidy <- chat %>%
  unnest_tokens(output =word, input=text) %>%
  anti_join(stop_words, by ="word")

chat_tidy

## # A tibble: 314 × 2
##    responses word      
##    <chr>     <chr>     
##  1 id1       chat      
##  2 id1       function  
##  3 id2       attend    
##  4 id2       library   
##  5 id2       librarians
##  6 id2       online    
##  7 id3       time      
##  8 id3       hard      
##  9 id3       time      
## 10 id3       finding   
## # … with 304 more rows

chat_tidy %>%
  count (word, sort= TRUE)

## # A tibble: 157 × 2
##    word           n
##    <chr>      <int>
##  1 library       19
##  2 chat          17
##  3 online        15
##  4 helpful       14
##  5 librarians    11
##  6 librarian     10
##  7 research      10
##  8 uic            8
##  9 articles       6
## 10 messaging      5
## # … with 147 more rows

chat_dtm <- chat_tidy %>%
  count(responses, word) %>%
  cast_dtm(responses, word, n)

chat_dtm

## <<DocumentTermMatrix (documents: 33, terms: 157)>>
## Non-/sparse entries: 295/4886
## Sparsity           : 94%
## Maximal term length: 13
## Weighting          : term frequency (tf)

chattemp <- textProcessor(chat$text, 
                    metadata = chat,  
                    lowercase=TRUE, 
                    removestopwords=TRUE, 
                    removenumbers=TRUE,  
                    removepunctuation=TRUE, 
                    wordLengths=c(3,Inf),
                    stem=TRUE,
                    onlycharacter= FALSE, 
                    striphtml=TRUE, 
                    customstopwords=NULL)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

stemmed_chat <- chat %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word))

stemmed_chat

## # A tibble: 314 × 3
##    responses word       stem     
##    <chr>     <chr>      <chr>    
##  1 id1       chat       chat     
##  2 id1       function   function 
##  3 id2       attend     attend   
##  4 id2       library    librari  
##  5 id2       librarians librarian
##  6 id2       online     onlin    
##  7 id3       time       time     
##  8 id3       hard       hard     
##  9 id3       time       time     
## 10 id3       finding    find     
## # … with 304 more rows

stemmed_dtm <- stemmed_chat %>%
  unnest_tokens(output = word, input = stem) %>%
  anti_join(stop_words, by = "word") %>%
  mutate(stem = wordStem(word)) 
  
stemmed_dtm

## # A tibble: 289 × 3
##    responses word      stem     
##    <chr>     <chr>     <chr>    
##  1 id1       chat      chat     
##  2 id1       function  function 
##  3 id2       attend    attend   
##  4 id2       librari   librari  
##  5 id2       librarian librarian
##  6 id2       onlin     onlin    
##  7 id3       time      time     
##  8 id3       hard      hard     
##  9 id3       time      time     
## 10 id3       articl    articl   
## # … with 279 more rows

stem_counts <- stemmed_chat %>%
  unnest_tokens(output = word, input = word) %>%
  anti_join(stop_words, by = "word") %>%
  count(stem, sort = TRUE)

stem_counts

## # A tibble: 140 × 2
##    stem          n
##    <chr>     <int>
##  1 librarian    21
##  2 help         19
##  3 librari      19
##  4 chat         18
##  5 onlin        15
##  6 articl       10
##  7 research     10
##  8 uic           8
##  9 resourc       7
## 10 assist        6
## # … with 130 more rows

n_distinct(chat$text)

## [1] 33

chat_lda <- LDA(chat_dtm, 
                  k = 3, 
                  control = list(seed = 588)
                  )

chat_lda

## A LDA_VEM topic model with 3 topics.

docs <- chattemp$documents 
meta <- chattemp$meta 
vocab <- chattemp$vocab

chat_stm <- stm(documents=docs, 
         data=meta,
         vocab=vocab, 
         K=3,
         max.em.its=25,
         verbose = FALSE)

chat_stm

## A topic model with 3 topics, 33 documents and a 191 word dictionary.

plot.STM(chat_stm, n = 5)

Knit & Submit

Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps to submit your work for review:

Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.
Commit your changes in GitHub Desktop and push them to your online GitHub repository.
Publish your HTML page the web using one of the following publishing methods:
- Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.
- Publishing on GitHub using either GitHub Pages or the HTML previewer.
Post a new discussion on GitHub to our Text mining Badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.

Topic Modeling Analysis Badge

LASER Institute TM Learning Lab 3

Jung Mi Scoulas

July 14, 2022

Part I: Reflect and Plan

Part II: Data Product

Knit & Submit