Import

Install libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(tidytext)
library(SnowballC)
library(knitr)
library(stm)

## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

library(topicmodels)
library(LDAvis)
library(ldatuning)

Import data set (using the same data as the Unit 3 walkthrough)

ts_forum_data <- read_csv("data/ts_forum_data.csv", 
     col_types = cols(course_id = col_character(),
                   forum_id = col_character(), 
                   discussion_id = col_character(), 
                   post_id = col_character()))

Preprocess

Pull bigrams, remove stop words, and stem words

forum_bigrams <- ts_forum_data %>%   
  unnest_tokens(output = bigram, input = post_content, token = "ngrams", n = 2)

forum_bigrams <- forum_bigrams %>% 
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  mutate(word1 = wordStem(word1)) %>% 
  mutate(word2 = wordStem(word2)) %>% 
  unite(bigram, c(word1, word2), sep = " ")

Analyze

forum_bigrams_plot <- forum_bigrams %>%
  group_by(discussion_name) %>%
  count(bigram, sort = TRUE) %>%
  top_n(20) %>%
  ungroup

## Selecting by n

Separating out the discussion forums with the top 4 most common bigrams in them. I did this because my file kept crashing when I tried to run the full data set.

top_four_repeat <- forum_bigrams_plot %>%
  filter(discussion_name == c("Animation videos", 
  "Large, messy data vs. small, neat examples", "Middle School Students", "Teaching Statistics Through Data Investigations Starts Today!")) %>%
  filter(n >1)

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `discussion_name == ...`.
## Caused by warning in `discussion_name == c("Animation videos",
##     "Large, messy data vs. small, neat examples", "Middle School Students",
##     "Teaching Statistics Through Data Investigations Starts Today!")`:
## ! longer object length is not a multiple of shorter object length

Graphing the data on the bigrams in the discussion forums that I separated out. I removed any bigrams that only had one instance in the discussion forum for the sake of space and computational load.

top_four_repeat %>%
  ggplot(aes(x = bigram, y = n, fill = bigram)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ discussion_name, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count",
       x = "Unique Bigrams",
       title = "Most frequent bigrams found in discussion forums",
       subtitle = "Stop words removed from the list and bigrams with only one instance were removed")

Visualize

Loading libraries I forgot at the beginning of the analysis.

library(udpipe)
library(BTM)

Parts of speech tagging and building the biterm topic model.

I tried to do this, but I’m not sure exactly how to make it work. I get an error message when I run it, and I don’t know how to fix it. This is the error message: Error in udpipe.data.frame(ts_forum_data, “english”, trace = 10) : all(c(“doc_id”, “text”) %in% colnames(x)) is not TRUE

#anno    <- udpipe(ts_forum_data, "english", trace = 10)
#biterms <- as.data.table(anno)
#biterms <- biterms[cooccurrence(x = post_content,
#                                  relevant = upos %in% c("NOUN",
#                                                         "ADJ",
#                                                         "PROPN"),
#                                  skipgram = 5),
#                   by = list(discussion_name)]


#set.seed(123)
#traindata <- subset(anno, upos %in% c("NOUN", "ADJ", "PROPN"))
#traindata <- traindata[, c("discussion_name", "post_content")]
#model <- BTM(traindata, k = 20, 
#             beta = 0.01, 
#             iter = 500,
#             biterms = biterms, 
#             trace = 100)

I started to look at the information on the different pieces of the code above and realized that (I think) my data isn’t in the right format to be compatible with the code. To be honest, I’m still not sure what format it should be in or what is wrong with my code, and I don’t have the skills to figure it out.

Communicate

Research Question: What can themes/ideas can bigrams illuminate in the discussion forum data?

What I found from exploring my data is that the bigrams could be helpful in identifying what the topic of each discussion was about. This could have been more effective if I had kept all of the bigrams in my final visualization, but I removed all of the one-time occurrences to make the data more legible and to minimize the computational burden. Some of the common bigrams were not very helpful, so in future iterations I would comb through my data more carefully to remove the stop words that are unique to my data set.

My biggest area of struggle with this analysis is that I was trying to do a biterm topic model (BTM) and I couldn’t figure out the code. I did manage to get my data into bigrams and make a graph, but I didn’t understand how to actually make the BTM when I got there. If I do topic modeling for my final project, I think I’ll stick with something closer to what we did in this unit’s walkthrough.

ECI 588 Unit 3 Independent Analysis

Grace Wiedrich

2023-03-26

Import

Preprocess

Analyze

Visualize

Communicate