A Text Mining Analysis of Movie Reviews

1.Executive Summary

Main Question: How do movie genres—specifically Horror and Romance—influence the emotional language used in audience reviews?

Overview: This project explores how language and emotion differ across movie genres by analyzing IMDb reviews for Horror and Romance films using text mining techniques in R. Through a series of visual and statistical analyses—including term frequency, TF-IDF, bigram networks, log odds ratio, word co-occurrence, phi coefficient networks, and sentiment distribution (NRC)—the study reveals distinct lexical and emotional profiles in each genre.

Key Insights: The visualizations show that horror reviews tend to cluster around themes of fear and violence, whereas romance reviews emphasize affection and emotional depth. This suggests that film genres not only influence narrative content but also shape the emotional vocabulary used in audience responses.

Final Graphics Include:

1.TF-IDF Analysis 2.Bigram Network 3.Log Odds Ratio 4.Co-occurrence Networks 5.Sentiment Analysis (NRC)

2.Data Background

The data used in this analysis is sourced from the IMDb Movie Reviews Dataset, containing 50,000 labeled reviews with corresponding full-text content.

As genre labels were not included, I manually assigned genres using a keyword-based heuristic: reviews with words such as “horror,” “ghost,” or “scary” were labeled as Horror; those containing “romance,” “love,” or “relationship” were labeled as Romance. Although this approach introduces some noise, it enables genre-level comparison grounded in lexical patterns.

To maintain balance, 500 horror and 500 romance reviews were selected for the analysis.

3. Data Preprocessing

3.1 Data Loading, Cleaning and Preprocessing

To prepare the dataset for text analysis, several preprocessing steps were carried out to ensure the quality and relevance of the data.

Overview of Preprocessing Steps Converted all text to lowercase

·Tokenized text into individual words using tidytext::unnest_tokens()

·Removed standard and domain-specific stop words (e.g., movie, film, character)

·Removed numbers, punctuation, and single-character tokens

·Mapped words to associated emotions using the NRC lexicon

These steps were essential for subsequent tasks such as term frequency analysis, sentiment classification, and bigram network construction.

Step 1: Load Data and Required Libraries

We begin by loading the dataset and importing essential packages for text mining, including tidytext, dplyr, ggplot2, and stringr

Step 2: Genre Classification by Keyword Filtering

Since the original dataset does not contain genre labels, we constructed genre categories (Horror and Romance) by keyword matching. This approach assumes that the presence of genre-relevant terms in the text is a reasonable proxy for classification.

# Define keyword lists for filter
horror_keywords <- c("horror", "scary", "blood", "ghost", "kill", "terror", "monster", "nightmare", "zombie", "dark")
romance_keywords <- c("love", "romantic", "relationship", "heart", "kiss", "passion", "marriage", "romance", "affair", "emotion")

df <- df %>%
  mutate(review_lower = tolower(review))

df_horror <- df %>%
  filter(str_detect(review_lower, str_c(tolower(horror_keywords), collapse = "|"))) %>%
  mutate(genre = "Horror")

df_romance <- df %>%
  filter(str_detect(review_lower, str_c(tolower(romance_keywords), collapse = "|"))) %>%
  mutate(genre = "Romance")

# Sample 500 reviews from each genre for balance
df_sample <- bind_rows(
  sample_n(df_horror, 500),
  sample_n(df_romance, 500)
) %>%
  mutate(doc_id = row_number())

Step 3: Text Cleaning and Tokenization

We removed HTML tags, special characters, and punctuations, then tokenized the text into words using unnest_tokens(). In addition to standard English stop words, we removed high-frequency but uninformative words specific to movie reviews (e.g., story, movie, watch).

# Define custom stop words relevant to movie review context
data("stop_words")
all_stop_words <- stop_words %>%
  bind_rows(tibble(word = c("movie", "film", "time","films", "story", "don", "character", 
                  "characters", "plot", "acting", "movies", "one", "get", 
                  "like", "just", "really", "make", "also", "good", "bad","people","scene", "watch"), 
         lexicon = "custom")
)

# Clean and tokenize text
tokens <- df_sample %>%
  mutate(review = str_replace_all(review, "<.*?>", " "), 
         review = str_to_lower(review),
         review = str_replace_all(review, "[^a-z\\s]", " ")) %>%
  unnest_tokens(word, review) %>%
  anti_join(all_stop_words, by = "word") %>%
  filter(!word %in% all_stop_words$word) %>%
  select(doc_id, genre, word)

tokens %>% count(genre)

## # A tibble: 2 × 2
##   genre       n
##   <chr>   <int>
## 1 Horror  56687
## 2 Romance 47852

4. Text Data Analysis

In this section, I explore how emotional language and word usage vary across horror and romance movie reviews. Through a combination of frequency analysis, TF-IDF weighting, and bigram networks, I aim to uncover distinctive linguistic patterns tied to each genre.

5. Individual Analysis and Figures

5.1 Analysis and Figure 1: Top 20 Most Frequent Words by Genre

To gain an initial understanding of vocabulary patterns in each genre, I calculated raw word frequencies and visualized the top 20 most frequent words in Horror and Romance reviews using bar plots. These results reveal dominant word choices that characterize each genre, offering a baseline for later comparative analysis

top_words <- tokens %>%
  count(genre, word, sort = TRUE) %>%
  group_by(genre) %>%
  slice_max(n, n = 20) %>%
  ungroup()

top_words %>%
  mutate(word = reorder_within(word, n, genre)) %>%  
  ggplot(aes(x = word, y = n, fill = genre)) +
  geom_col(show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~ genre, scales = "free") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words in Horror vs. Romance Reviews",
       x = "word", y = "frequency") +
  theme_minimal()

This plot illustrates the top 20 most frequently used words in IMDb reviews of horror and romance films, highlighting distinct patterns in language usage between the two genres.

In horror reviews, words such as “horror,” “scenes,” “death,” “dark,” and “world” are prominent. These terms reflect the genre’s thematic focus on fear, mortality, and atmosphere. The frequent use of “real” and “makes” may suggest how viewers perceive the film’s believability or emotional impact.

On the other hand, romance reviews are dominated by emotionally resonant words like “love,” “life,” “real,” “girl,” and “fun.” These indicate a focus on relationships, authenticity, and enjoyment. Additionally, terms like “music,” “cast,” and “book” suggest a broader engagement with character portrayal and narrative source material.

Interestingly, some overlap exists between the two genres with words like “ve,” “scenes,” “real,” and “watching,” which likely represent general review language or common viewer experiences regardless of genre.

Overall, the chart reveals that while horror reviews are shaped by themes of fear and suspense, romance reviews emphasize emotional depth, personal connection, and character-driven storytelling.

5.2 Analysis and Figure 2: Top Genre-Specific Words by TF-IDF

While frequency reveals common words, it does not indicate which words are particularly distinctive to a genre. To address this, I computed the term frequency-inverse document frequency (TF-IDF) scores for each word within each genre. This method down-weights overly common terms and emphasizes those that are uniquely frequent in one category.

# Count total word occurrences by genre
word_counts <- tokens %>%
  count(genre, word, sort = TRUE)

# Compute TF-IDF scores
tf_idf_words <- word_counts %>%
  bind_tf_idf(word, genre, n) %>%
  arrange(desc(tf_idf))

# Select top 20 TF-IDF words per genre
top_tf_idf <- tf_idf_words %>%
  group_by(genre) %>%
  slice_max(tf_idf, n = 20) %>%
  ungroup()

# Visualize top TF-IDF words
top_tf_idf %>% 
  mutate(word = reorder_within(word, tf_idf, genre)) %>%
  ggplot(aes(x = word, y = tf_idf, fill = genre)) +
  geom_col(show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~genre, scales = "free") +
  coord_flip() +
  labs(title = "Top Representative Words by TF-IDF in Each Movie Genre",
       x = "Word", y = "TF-IDF") +
  theme_minimal()

This visualization presents the top representative words by TF-IDF (term frequency–inverse document frequency) in horror and romance movie reviews, highlighting terms that are uniquely significant to each genre.

In horror reviews, distinctive words include “werewolf,” “infected,” “killjoy,” “zombi,” “rip,” and “cia.” These terms are closely tied to themes of death, supernatural elements, and violence, which are central to the horror genre. Names like “freddy,” “felix,” and “jericho” may refer to iconic horror characters or settings, reinforcing genre-specific references.

In contrast, romance reviews feature representative words such as “gadget,” “jefferson,” “sinatra,” “sally,” “cagney,” “grief,” and “ashes.” Many of these appear to be character or actor names, suggesting a focus on individual performances or relationships central to the narrative. Words like “grief” and “ashes” reflect the emotional depth and occasional melancholy often explored in romance stories.

Rather than simply emotional or terrifying vocabulary, this TF-IDF-based analysis reveals genre-specific entities, character names, and cultural references that are especially representative within each type of review. This method provides a more nuanced view of genre distinction, focusing not just on sentiment but on unique narrative and thematic elements.

5.3 Analysis and Figure 3: Bigram Network Analysis

To further understand contextual word relationships in each genre, I conducted a bigram network analysis, which visualizes frequently co-occurring two-word phrases. This method goes beyond single-word frequency to uncover common narrative patterns and expressions used in horror and romance reviews.

I first preprocessed the review texts by removing HTML tags, converting all text to lowercase, filtering out non-alphabetic characters, and excluding stop words. I then extracted bigrams and retained only those that appeared more than five times to focus on meaningful patterns.

bigrams_by_genre <- df_sample %>%
  mutate(review = str_replace_all(review, "<.*?>", " "),
         review = str_to_lower(review),
         review = str_replace_all(review, "[^a-z\\s]", " ")) %>%
  unnest_tokens(bigram, review, token = "ngrams", n = 2) %>%
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%
  count(genre, word1, word2, sort = TRUE) %>%
  filter(n > 5)

then split the bigrams by genre and constructed networks where nodes represent individual words and edges represent the frequency of bigram co-occurrence.

horror_bigrams <- bigrams_by_genre %>% filter(genre == "Horror")
romance_bigrams <- bigrams_by_genre %>% filter(genre == "Romance")

horror_graph <- graph_from_data_frame(horror_bigrams[, c("word1", "word2", "n")])
romance_graph <- graph_from_data_frame(romance_bigrams[, c("word1", "word2", "n")])

The resulting visualizations clearly illustrate the different narrative focuses of each genre

Horror Movie Bigram Network

set.seed(2025)
  ggraph(horror_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "blue") +
  geom_node_point(size = 4, color = "lightblue") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() +
  labs(title = "Horror Movie Review Bigram Network")

## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Romance Movie Bigram Network

set.seed(2025)
  ggraph(romance_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "red") +
  geom_node_point(size = 4, color = "pink") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() +
  labs(title = "Romance Movie Review Bigram Network")

The Horror movie bigram network reveals commonly paired terms that evoke intense imagery and suspenseful atmosphere. Frequent combinations like “movie horror,” “entire candy,” “story dripping,” and “low budget” suggest recurring themes of fear, production critique, and grotesque or violent aesthetics. Terms like “zombie films,” “serial killer,” and “bloodbath scenes” emphasize the genre’s reliance on horror tropes and sensational content.

In contrast, the Romance movie bigram network highlights emotionally charged and character-driven word pairings. Phrases such as “love story,” “supporting affair,” “loved one,” and “romantic real” underscore common motifs of affection, relationship dynamics, and emotional journeys. The presence of names (e.g., “jane austen”) and settings (e.g., “york city”) further reflects the genre’s tendency to anchor narratives in personal and often idealized contexts.

5.4 Analysis and Figure 4: Log Odds Ratio of Genre-Distinctive Words

To further uncover words that are statistically distinctive between genres, I used the log odds ratio with an informative Dirichlet prior. Unlike simple frequency or TF-IDF, the log odds ratio considers both the frequency and relative uniqueness of words across groups, making it effective for comparing how strongly a word is associated with a particular genre.

Using the tidylo package, I computed the weighted log odds for each word by genre. The top 20 words with the highest log odds in each genre are visualized below.

word_counts <- tokens %>%
  count(genre, word, sort = TRUE)

log_odds <- word_counts %>%
  bind_log_odds(set = genre, feature = word, n = n) %>%
  arrange(desc(log_odds_weighted))

top_log_odds <- log_odds %>%
  group_by(genre) %>%
  slice_max(log_odds_weighted, n = 20) %>%
  ungroup()

top_log_odds %>%
  mutate(word = reorder_within(word, log_odds_weighted, genre)) %>%
  ggplot(aes(x = word, y = log_odds_weighted, fill = genre)) +
  geom_col(show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~genre, scales = "free") +
  coord_flip() +
  labs(title = "Words Most Distinctive for Horror vs Romance (Log Odds Ratio)",
       x = "Word",
       y = "Weighted Log Odds Ratio") +
  theme_minimal()

The log odds ratio plot highlights words that are statistically distinctive to each genre rather than merely frequent. For Horror reviews, words like “werewolf,” “infected,” “zombi,” “killjoy,” and “rip” stand out, pointing to themes of death, supernatural elements, and violence. Names such as “freddy” and “jericho” may refer to iconic horror figures or settings.

In contrast, Romance reviews are characterized by emotionally resonant and interpersonal terms such as “love,” “grief,” “ashes,” and “gadget”—the latter possibly symbolizing personal items or metaphors in romantic storytelling. The presence of names like “sinatra,” “jefferson,” “astaire,” and “cagney” suggests references to classic romance figures or actors.

This method effectively surfaces genre-specific vocabulary, showing not only which words are common but which are disproportionately used in one genre over another.

5.5 Analysis and Figure 5: Co-occurrence Network of Words

To explore how words tend to appear together within the same document, I conducted a co-occurrence analysis using pairwise word counts. This analysis reveals thematic word clusters by identifying frequently co-appearing words, helping us understand how genre-specific vocabularies form semantic associations.

I focused on relatively frequent words (appearing at least 10 times) in each genre and visualized the top 100 co-occurring word pairs (with at least 5 co-occurrences) as networks.

# Calculate co-occurrence for Horror and Romance genres
horror_tokens <- tokens %>%
  filter(genre == "Horror") 

frequent_words_h <- horror_tokens %>%
  count(word) %>%
  filter(n >= 10) 

horror_corr <- horror_tokens %>%
  semi_join(frequent_words_h, by = "word") %>%
  pairwise_count(word, doc_id, sort = TRUE, upper = FALSE)

# Romance
romance_tokens <- tokens %>% 
  filter(genre == "Romance")

frequent_words_r <- romance_tokens %>%
  count(word) %>%
  filter(n >= 10) 

romance_cooc <- romance_tokens  %>%
  semi_join(frequent_words_r, by = "word") %>%
  pairwise_count(word, doc_id, sort = TRUE)

# Plot function for co-occurrence graph
plot_cooc_graph <- function(cooc_df, genre_title, color_edge, color_node) {
  graph <- cooc_df %>%
    filter(n >= 5) %>%
    top_n(100, n) %>%
    graph_from_data_frame()

  set.seed(2025)
  ggraph(graph, layout = "fr") +
    geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = color_edge) +
    geom_node_point(size = 4, color = color_node) +
    geom_node_text(aes(label = name), repel = TRUE, size = 4) +
    theme_void() +
    labs(title = paste(genre_title, "Genre Word Co-occurrence Network"))
}

# Generate plots
plot_cooc_graph(horror_corr, "Horror", "darkred", "firebrick")

plot_cooc_graph(romance_cooc, "Romance", "darkblue", "steelblue")

The co-occurrence networks further illustrate the structural differences in language use across genres. In the Romance network, the word “love” acts as a central hub, frequently co-occurring with words like “beautiful,” “feel,” “friends,” and “perfect,” reinforcing themes of emotion, aesthetics, and social connection. This network reflects a cohesive, emotionally rich semantic structure.

Conversely, the Horror network, while still featuring “love” and “watching,” centers more around terms like “horror,” “blood,” “death,” “gore,” and “kill,” revealing a lexical web steeped in violence, tension, and grim visuals. The appearance of phrases like “low budget” and “special effects” also reflects audience commentary on production aspects typical of the horror genre.

Together, these visualizations show how language not only conveys content but also embodies the thematic essence of each genre. Horror emphasizes physicality and fear, while romance leans toward emotionality and beauty.

5.6 Analysis and Figure 6: Word Association Network Based on Phi Coefficient

To investigate the associative relationships between words in movie reviews, I conducted a word association analysis using the phi coefficient. First, I tokenized the reviews into individual words and assigned document IDs to each token. Then, I filtered for relatively frequent words that appeared at least 15 times across the dataset.

Using these filtered words, I calculated the phi correlation coefficient between pairs of words based on their co-occurrence within the same documents. I retained word pairs with a correlation greater than 0.4 to focus on strong associations.

# Prepare the data
imdb_tidy <- df_sample %>%
  unnest_tokens(word, review)   %>%
  mutate(doc_id = doc_id) %>%
  add_count(word) %>%
  filter(n >= 15) %>%
  pairwise_cor(item = word, feature = doc_id, sort = TRUE) %>%
  filter(correlation > 0.4)

# Create graph object
graph_phi <- imdb_tidy %>%
  as_tbl_graph(directed = FALSE) %>%
  mutate(
    centrality = centrality_degree(),
    group = as.factor(group_infomap()))

# Visualize the word association network
set.seed(123)
ggraph(graph_phi, layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), color = "gray50", show.legend = FALSE) +
  scale_edge_width(range = c(0.5, 3)) +
  geom_node_point(aes(size = centrality, color = group), show.legend = FALSE) +
  scale_size(range = c(3, 8)) +
  geom_node_text(aes(label = name), repel = TRUE, size = 5,color ="black") +
  theme_graph()

  labs(title = "Network Graph Based on Phi Coefficient")

## $title
## [1] "Network Graph Based on Phi Coefficient"
## 
## attr(,"class")
## [1] "labels"

In this network, we observe distinct semantic clusters:

Genre-specific themes such as “zombie – zombies,” “witch – blair,” and “sci – fi” reflect strong co-association of conceptually related horror terms. The presence of “kung – fu” and “science – fiction” also points to subgenre blending, indicating how horror can intersect with action or speculative fiction.

A musical or classical film cluster is visible in pairs like “sinatra – gene,” “kelly – songs,” and “singing – episodes,” likely associated with reviews that mention old Hollywood or musical romances. These connections underscore the sentimental and nostalgic tone often present in Romance reviews.

The pair “low – budget” and its connection to “special – effects” suggests audience commentary around production value—typically more prevalent in Horror genre reviews, where budget is often a point of critique or interest.

This network demonstrates how the phi coefficient surfaces not only the most common pairings but the most cohesively associated ones within documents, offering a richer understanding of the latent themes and semantic coherence present in each genre.

5.7 Analysis and Figure 7: Sentiment Analysis Using NRC Lexicon

To compare the emotional characteristics of horror and romance movie reviews, I utilized the NRC sentiment lexicon, which classifies words into eight core emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

After joining the tokenized reviews with the NRC lexicon, I calculated the frequency of each sentiment category within the horror and romance genres. To ensure a fair comparison, the frequencies were normalized by the total word count of each genre.

# Load and filter NRC sentiment lexicon
nrc <- get_sentiments("nrc") %>%
  filter(sentiment %in% c("anger", "anticipation", "disgust", "fear", 
                          "joy", "sadness", "surprise", "trust"))

# Match tokens with sentiments and calculate normalized frequencies
sentiment_summary <- tokens %>%
  inner_join(nrc, by = "word") %>%
  group_by(genre, sentiment) %>%
  summarise(count = n(), .groups = "drop") %>%
  left_join(
    tokens %>% group_by(genre) %>% summarise(total = n()),
    by = "genre"
  ) %>%
  mutate(freq = count / total)

## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 5 of `x` matches multiple rows in `y`.
## ℹ Row 561 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

# Visualize sentiment frequency comparison
ggplot(sentiment_summary, aes(x = sentiment, y = freq, fill = genre)) +
  geom_col(position = "dodge") +
  labs(title = "Comparison of NRC Sentiment Frequencies in Horror vs Romance Reviews",
       x = "Emotion Category",
       y = "Normalized Word Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The comparison of NRC sentiment frequencies between Horror and Romance genres reveals genre-specific emotional landscapes:

Horror reviews are strongly characterized by fear, anger, and disgust, with fear standing out as the most dominant emotion. This aligns with the core purpose of horror films—to evoke discomfort, suspense, and tension. The elevated presence of sadness also suggests that horror often engages themes of loss, trauma, or despair.

In contrast, Romance reviews exhibit higher levels of trust, joy, and anticipation, emotions typically associated with connection, optimism, and emotional intimacy. Trust is notably the most frequent emotion in Romance, reinforcing the genre’s emphasis on relationships, security, and emotional resolution.

Overall, the sentiment frequency distribution underscores how each genre elicits a distinct affective experience. Horror evokes negative high-arousal emotions, aligning with its unsettling themes, whereas Romance fosters positive and secure emotional states, reflecting its focus on love, hope, and human connection.

Conclusion

This project demonstrates how text mining and sentiment analysis can reveal genre-specific patterns in movie reviews. Horror and Romance films do not only differ in plot but also in lexical choices and emotional resonance, as reflected by viewer responses. The combination of frequency-based and relational analyses provides a holistic view of genre-driven discourse, offering valuable insights into how different genres evoke and structure language and emotion.