Text Mining Independent Project 2 Speckart

Research Question

I thought it would be interesting to see if online discussions about movies showed a persistent set of recurring topics that could be discovered through topic modeling, even as genres and subjects of discussion changed. To look into this question, I looked to downloading discussions from the media sharing and discussion site Reddit (www.reddit.com), which hosts thousands of user-moderated discussion groups on almost any topic. It has a number of “subreddits” devoted to movies and movie-making, from discussing cinema in general to discussing cinematography.

Defining what a movie “topic” is in such an diverse discussion forum as Reddit is an open question, which brings both opportunity and difficulty when using topic modeling methods. No clear count of topics was seen in a literature review of similar attempts to use topic modeling in online movie discussions, so optimization algorithms were used to find ideal sets of topics. The topic counts were limited in these algorithms to be under 15 topics to aid in human interpretation of the results.

Data Description

Looked at list of movie related subreddits: https://www.reddit.com/r/movies/comments/wxee2/the_big_list_of_movie_related_subreddits_2/. I selected subreddits with heavy comment activity–many subreddits have few posts and few comments. For example, r/horror was selected because it is dominated by movie-related comments, as opposed to r/scifi, another heavily frequented subreddit, which covers books and other art forms heavily, and therefore is not as focused on movies. The list of subreddits selected was r/movies, r/film, r/classicfilms, r/bollywood, r/documentaries, r/badmovies, and r/horror.

I used https://camas.unddit.com/ reddit search engine to gather 1000 comments from 3 month period of January 1 to April 1, 2022. Each subreddit’s comments were saved as JSON files and imported into R.

Because the LDA and STM modeling methods look for separate documents that they can split into indivdual “bags of words,” and because the posts on reddit can tend to be very short and fragmentary, the content of each subreddit was treated as a single document. The text was tokenized, and the imported text needed its curly apostrophes converted to straight apostrophes in order for stop words to be identified. Some custom stop words were used to reduce duplication of common words that cluttered initial models.

# Read in data files

reddit_movies_data <- fromJSON("movies.json", flatten = FALSE)
reddit_badmovies_data <- fromJSON("badmovies.json", flatten = FALSE)
reddit_bollywood_data <- fromJSON("bollywood.json", flatten = FALSE)
reddit_classicfilms_data <- fromJSON("classicfilms.json", flatten = FALSE)
reddit_documentaries_data <- fromJSON("documentaries.json", flatten = FALSE)
reddit_film_data <- fromJSON("film.json", flatten = FALSE)
reddit_horror_data <- fromJSON("horror.json", flatten = FALSE)

#pick out data list elements of json files as their own dataframe, select most useful columns, add new column listing document name

movies <- reddit_movies_data$data %>%
  dplyr::select('author', 'body', 'created_utc', 'is_submitter', 'parent_id', 'permalink', 'score') %>%
  mutate(source="movies")
badmovies <- reddit_badmovies_data$data %>%
  dplyr::select('author', 'body', 'created_utc', 'is_submitter', 'parent_id', 'permalink', 'score') %>%
  mutate(source="badmovies")
bollywood <- reddit_bollywood_data$data %>%
  dplyr::select('author', 'body', 'created_utc', 'is_submitter', 'parent_id','permalink', 'score') %>%
  mutate(source="bollywood")
classicfilms <- reddit_classicfilms_data$data %>%
  dplyr::select('author', 'body', 'created_utc', 'is_submitter', 'parent_id', 'permalink', 'score') %>%
  mutate(source="classicfilms")
documentaries <- reddit_documentaries_data$data %>%
  dplyr::select('author', 'body', 'created_utc', 'is_submitter', 'parent_id','permalink', 'score') %>%
  mutate(source="documentaries")
film <- reddit_film_data$data %>%
  dplyr::select('author', 'body', 'created_utc', 'is_submitter', 'parent_id','permalink', 'score') %>%
  mutate(source="film")
horror <- reddit_horror_data$data %>%
  dplyr::select('author', 'body', 'created_utc', 'is_submitter', 'parent_id','permalink', 'score') %>%
  mutate(source="horror")

# combine into one dataframe with new column listing document name

data <- movies %>%
  rbind(badmovies, bollywood, classicfilms, documentaries, film, horror)

# replace fancy apostrophes with regular apostrophes to allow processing via stop_words

data$body<-gsub("’", "'",data$body)

# Tokenize text

# tokenize posts by word, remove stop words, filter out custom stop words

text_tidy <- data %>%
  unnest_tokens(output = word, input = body) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word %in% c(0:9, "https", "http", "[deleted]", "removed", "movie", "movies", "post", "film", "films", "shit", "fuck", "it's", "cc", "www.reddit.com"))

LDA Modeling

To determine the number of topics for the LDA models, the FindTopicsNumber() function of the tidytext package in R was used, and early exploration showed a potential set of optimal models in the 50+ topic count range, but that amount of topics is likely overfitted and is not appropriate for this study. The function was refined to find an optimum number of topics between 2 and 15 topics total in order to keep the results human interpretable and to avoid overfitting.

# Build document matrix

text_dtm <- text_tidy %>%
  count(source, word) %>%
  cast_dtm(source, word, n)

# Looking at stem words

stemmed_text <- data %>%
  unnest_tokens(output = word, input = body) %>%  
  anti_join(stop_words, by = "word") %>%
  filter(!word %in% c(0:9, "https", "http", "[deleted]", "removed", "movie", "movies", "post", "film", "films", "shit", "fuck", "it's", "cc", "www.reddit.com")) %>%
  mutate(stem = wordStem(word)) 

# Convert stemmed text into document matrix

stemmed_text_dtm <- stemmed_text %>%
  count(source, stem) %>%
  cast_dtm(source, stem, n)

# Find stem counts

stem_counts <- stemmed_text %>%
  count(stem, sort = TRUE)

view(stem_counts)

k_metrics <- FindTopicsNumber(
  text_dtm,
  topics = seq(2, 15, by = 1 ),
  metrics = c("Griffiths2004","CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)

FindTopicsNumber_plot(k_metrics)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ldatuning package.
##   Please report the issue at <]8;;https://github.com/nikita-moor/ldatuning/issueshttps://github.com/nikita-moor/ldatuning/issues]8;;>.

There seems to be a breakpoint at 5 or 6 in the FindTopicsNumber_plots, so I tried a few models around those numbers to select an optimum K.

text_lda_5 <- LDA(text_dtm, 
                  k = 5, 
                  control = list(seed = 588)
)

# Let's try an LDA model with 6 topics, which was a break point in the FindTopicsNumber_plot results

text_lda_6 <- LDA(text_dtm, 
                k = 6, 
                control = list(seed = 588)
)


# Just in case, let's try LDA with 8 topics, which matches what we'll get for STM models

text_lda_8 <- LDA(text_dtm, 
                  k = 8, 
                  control = list(seed = 588)
)

There was an inflection point in the plot of the FindTopicsNumber() function between 5 and 6, and while the results of 5 topic model were indistinct, the results of the 6 topic model were more clear and useful, and are shown below.

terms(text_lda_6, 7)

##      Topic 1     Topic 2         Topic 3       Topic 4   Topic 5   Topic 6     
## [1,] "time"      "documentary"   "bollywood"   "bad"     "horror"  "required"  
## [2,] "love"      "2020"          "promotional" "people"  "watch"   "bot"       
## [3,] "watch"     "murder"        "rules"       "time"    "account" "performed" 
## [4,] "story"     "documentaries" "people"      "watch"   "love"    "people"    
## [5,] "classic"   "length"        "link"        "deleted" "people"  "community" 
## [6,] "amp"       "title"         "posts"       "love"    "dream"   "moderators"
## [7,] "character" "trailer"       "hindi"       "pretty"  "time"    "source"

tidy_lda <- tidy(text_lda_6)

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 7, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

I interpret these topics as the following:

Topic 1: related to why people like to watch movies in general

Topic 2: related to documentaries

Topic 3: related to Bollywood movies

Topic 4: influence to bad movies, and why people stop watching movies

Topic 5: influence from horror movies

Topic 6: related to subreddit moderation and post management

But how descriptive are they of the data? Let’s look at the beta and gamma measures of this model.

Looking at Beta, Gamma, and other measures

# looking at gamma

td_beta <- tidy(text_lda_6)

td_gamma <- tidy(text_lda_6, matrix = "gamma")

td_beta

## # A tibble: 97,398 × 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 _1    3.61e-116
##  2     2 _1    1.02e-103
##  3     3 _1    3.82e-113
##  4     4 _1    5.51e-  5
##  5     5 _1    2.68e-112
##  6     6 _1    6.28e-107
##  7     1 _al   5.82e-116
##  8     2 _al   1.20e-103
##  9     3 _al   1.63e-112
## 10     4 _al   5.51e-  5
## # … with 97,388 more rows

td_gamma

## # A tibble: 42 × 3
##    document      topic      gamma
##    <chr>         <int>      <dbl>
##  1 badmovies         1 0.00000288
##  2 bollywood         1 0.00000172
##  3 classicfilms      1 1.00      
##  4 documentaries     1 0.00000149
##  5 film              1 0.134     
##  6 horror            1 0.00000267
##  7 movies            1 0.00000241
##  8 badmovies         2 0.00000288
##  9 bollywood         2 0.00000172
## 10 classicfilms      2 0.00000132
## # … with 32 more rows

The beta values are extremely small, meaning that individual words are not providing much to the definition of each topic, but in contrast, some of the gamma values are relatively large. In particular, the gamma values for the r/classicfilms and r/film subreddits are 0.9 and 0.1 respectively, which are 5 orders of magnitude larger than for the other subreddits. This implies that the models are well fitting for these subreddits in particular.

Now let’s examine the expected top 10 words for each topic.

top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  arrange(-beta) %>%
  #  select('topic', 'term') %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(terms)`

gamma_terms <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

gamma_terms %>%
  #  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Expected topic proportion", "Top 10 terms"))

Topic	Expected topic proportion	Top 10 terms
Topic 4	0.294	bad, people, time, watch, deleted, love, pretty, fun, lot, scene, yeah, gt, guy, hard, character, watched, lol, remember, watching, actor
Topic 3	0.235	bollywood, promotional, rules, people, link, posts, hindi, english, posting, watch, meme, action, youtube, time, political, subreddit, message, read, language, actors
Topic 5	0.166	horror, watch, account, love, people, dream, time, dead, deleted, house, favorite, black, lot, karma, nightmare, watched, game, scene, pretty, final
Topic 1	0.162	time, love, watch, story, classic, amp, character, people, lot, watched, watching, life, favorite, 10, fun, loved, hollywood, bit, deleted, star
Topic 2	0.130	documentary, 2020, murder, documentaries, length, title, trailer, 4chan, people, based, narrative, story, bianca, time, description, 59, submission, 00, online, subreddit
Topic 6	0.013	required, bot, performed, people, community, moderators, source, contact, 01, xx:xx:xx, automatically, time, compose, feel, concerns, subreddit, comment, money, crime, documentary

There are no big surprises here, but the results are consistent with what was seen above.

STM Modeling

To look at struture topic modeling of these documents, which brings in the metadata of column titles from the dataset into the modeling algorithm, the searchK() function in the tidytext package in R was used to find between 5 and 15 topics for our STM model. Based on the resulting plot, a count of 8 topics was chosen, which is close to the 6 topics used in the LDA model.

temp <- textProcessor(data$body, 
                      metadata = data,  
                      lowercase=TRUE, 
                      removestopwords=TRUE, 
                      removenumbers=TRUE,  
                      removepunctuation=TRUE, 
                      wordLengths=c(3,Inf),
                      stem=TRUE,
                      onlycharacter= FALSE, 
                      striphtml=TRUE, 
                      customstopwords=NULL)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents

# Let's try to find K mathematically for our STM models

findingk <- searchK(docs, 
                    vocab, 
                    K = c(5:15),
                    data = meta, 
                    verbose=FALSE)

plot(findingk)

Let’s go with 8, close to the LDA model, because there’s an inflection point there in the plot.

text_stm <- stm(documents=docs, 
                  data=meta,
                  vocab=vocab,
                  prevalence =~ permalink + source,
                  K=8,
                  max.em.its=25,
                  gamma.prior='L1',
                  verbose = FALSE)


# Plot the STM model to see what the stems are like

plot.STM(text_stm, n = 7)

# Let's look at it in a ToLDavis chart

toLDAvis(mod = text_stm, docs = docs)

## Loading required namespace: servr

The STM method using word stems produced topics that were broadly similar to the LDA model, but were less clear to interpetation. The STM model topics could be described as follows:

Topic 1: perhaps related to conversations in movies Topic 2: perhaps related to documentaries Topic 3*: The dominant topic, perhaps about suggesting movies to others Topic 4: related to musicals Topic 5: perhaps related to positive movie experiences Topic 6: related to documentaries Topic 7: perhaps related to comparing old and new movies, perhaps largely horror movies Topic 8: perhaps related to unbelievable elements in movies

These eight topics have occasional words that identify distinguishable topics, such as “song” in topic 4 and “documentari” in topic 6, and these topics do depict a broader range of discussions than the LDA model.

Conclusion and Discussion of Limitations

The LDA method yielded more interpretable results with 6 topics, as opposed to the STM method’s 8 topics. The amount of overlap between word distributions in topics for both methods led to results that were less distinct than desired, however the STM models resulted in more duplication between topics than the LDA model, making its additional 2 topics fairly redundant. I believe this is because the metadata included terms that were either already in the corpus (such as “movie” or “bollywood”), or they did not add information that was tied to any particular theme.

In the end, it is interesting that the LDA model produced 6 topics from the original 7 subreddits, with one of those topics being related to forum maintenance issues. Thus only 5 topics were about the content of movies. This suggests that the structure of online movie discussions are largely repetitive and similar across interest groups, and that the incidental mentions of a specific character or actor name do not significantly change the topics of discussion.

There are potential limitations to this analysis due to the nature of discussions on Reddit. The comments on each subreddit tend to be short and informally phrased, and can often be just a single word or phrase, rather than a long and coherent series of sentences by each other. Those longer posts do appear regularly, but the majority of posts are much shorter. In theory, the “bag of words” approach used in LDA and STM models ignore the length of the original posts, and this study pooled togther the comments of each subreddit into individual “bags” to create documents to study.

However, perhaps a limiting factor to this topic modeling approach is the content of Reddit itself. The writing style of many short posts on Reddit leave out the subjects and objects of sentences, such as comments of a simple “yeah, I agree” or “I hated that.” These comments will provide little information to topic model. Perhaps the less clear results from the STM model reflect these vague comments: as metadata is added, the model accurately sees that there are a large amount of short, disconnected conversations, and therefore it will tend to produce a larger number of topics that may reflect the variety of Reddit discussions, but that are not helpful in interpeting them.