Executive summary

The rapid growth of digital media has made film access easy via OTT, leading to increased audience interaction through online reviews. Analyzing these intuitive, anonymous reviews is crucial for guiding future film production and marketing.

This report utilized 10,000 reviews from IMDb movie review data to analyze audience sentiment and linguistic patterns using text mining.

My analysis involved identifying overall positive/negative sentiment trends via word clouds and frequency charts. Also I analyzied co-occurrence patterns for top positive/negative keywords using Phi coefficients. Moreover, I identified distinctive words for comedy, horror, and fantasy genres using TF-IDF.

As a result of the analysis, positive reviews demonstrated more diverse language than negative ones. While both sentiments featured direct approval/disapproval keywords like ‘bad’ and ‘love’, positive reviews showed slightly higher frequency for other associated terms. Phi coefficient analysis revealed that negative sentiments often linked to characters/persons, whereas positive sentiments were more connected to film atmosphere or genre traits. Lastly, TF-IDF showed that comedy reviews distinctively mentioned famous actors, horror reviews cited directors or the genre itself, and fantasy reviews featured movie titles and character names. This highlights varying audience engagement points across genres.

In conclusion, to foster positive audience reception, marketing should subtly emphasize a film’s atmosphere. Furthermore, leveraging specific actors, directors, or characters relevant to each genre’s distinct appeal is vital for effective engagement.

Data background

The original data for this analysis is the IMDb movie review dataset, imdb_movies_dataset.csv, provided by Aman Barthwal on Kaggle. This dataset provides 10,000 movie reviews, with each review containing the following key information:

Title: The movie’s title

Genre: The movie’s genre(s)

Review Title: The title of the review

Review: The detailed text content of the movie review

This dataset provides a vast amount of unstructured text data written by actual users.

Data loading, cleaning and preprocessing

I loaded the imdb_movies_dataset.csv file using the read_csv() function. To reduce unnecessary complexity, I selected only the essential columns: Title, Genre, Review Title, and Review. Additionally, a review_id column was added to each review using the row_number() function, which allowed me to track word-review relationships in subsequent analytical stages.

For text preprocessing, I replaced HTML line break tags like within the review text with standard spaces. Then I removed all special characters, retaining only alphabetic characters and spaces to ensure clean English text. Following this, each review text was tokenized into individual words. To filter out excessively long or potentially malformed words, I removed any words exceeding 10 characters. Furthermore, beyond the standard stop_words lexicon, I specifically removed the word ‘plot’ as it frequently appears in movie reviews but does not significantly impact sentiment analysis.

movie_review <- read_csv("imdb_movies_dataset.csv")
## Rows: 10000 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Poster, Title, Certificate, Genre, Director, Cast, Description, Rev...
## dbl (4): Year, Duration (min), Rating, Metascore
## num (2): Votes, Review Count
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
custom_stop_words <- bind_rows(
  tibble(word = "plot", lexicon = "custom"), stop_words)

movie_tidy <- movie_review %>%
  select(Title, Genre, `Review Title`, Review) %>%
  mutate(review_id = row_number()) %>%
  mutate(Review = str_replace_all(Review, "<br\\s*/*>", " ")) %>% 
  mutate(Review = str_replace_all(Review, "[^a-zA-Z\\s]", "")) %>% 
  unnest_tokens(word, Review) %>% 
  filter(nchar(word) <= 10) %>%
  anti_join(custom_stop_words, by = "word")
movie_tidy
## # A tibble: 965,728 × 5
##    Title           Genre                  `Review Title`       review_id word   
##    <chr>           <chr>                  <chr>                    <int> <chr>  
##  1 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 film   
##  2 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 reacti…
##  3 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 wonder…
##  4 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 modern 
##  5 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 story  
##  6 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 adult  
##  7 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 woman  
##  8 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 attrac…
##  9 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 guy    
## 10 The Idea of You Comedy, Drama, Romance Hypocrisy as an idea         1 son    
## # ℹ 965,718 more rows

Individual analysis and figures

1. Overall Positive/Negative Sentiment Trend Identification in Movie Reviews

To understand the trends in positive and negative movie review sentiments, the Bing lexicon was utilized. During the analysis, I identified that ‘funny’ was classified as a negative term in the lexicon, so I reclassified it as a positive term.

Upon generating separate word clouds for negative and positive sentiments, terms such as ‘bad’, ‘hard’, ‘wrong’, ‘dark’, and ‘death’ were prominent in the negative word cloud. For positive sentiments, words like ‘love’, ‘fun’, ‘funny’, ‘excellent’, and ‘pretty’ appeared frequently. While both sentiment categories prominently featured straightforward expressions of like/dislike such as ‘bad’ and ‘love’, a closer examination of word frequency revealed a greater diversity of expressions among positive terms, excluding ‘love’. It can be inferred from the size of the words in the word cloud, since the positive cloud has bigger words than the negative one. In the negative word cloud, terms other than the top-ranked ‘bad’ showed a noticeable frequency gap from ‘bad’ and generally appeared with similar, lower frequencies.

#sentiment word cloud
bing_lexicon <- get_sentiments("bing")

custom_bing_lexicon <- get_sentiments("bing") %>%
  mutate(sentiment = ifelse(word == "funny", "positive", sentiment))

imdb_sentiment_big <- movie_tidy %>%
  inner_join(custom_bing_lexicon, by = "word") %>% 
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%                   
  slice_max(n, n = 50) %>%      
  ungroup()
## Warning in inner_join(., custom_bing_lexicon, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 462945 of `x` matches multiple rows in `y`.
## ℹ Row 2822 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
imdb_sentiment_big
## # A tibble: 101 × 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 bad    negative   2566
##  2 hard   negative   1207
##  3 wrong  negative    780
##  4 dark   negative    778
##  5 death  negative    698
##  6 lost   negative    668
##  7 worst  negative    633
##  8 dead   negative    582
##  9 boring negative    562
## 10 evil   negative    548
## # ℹ 91 more rows
#visualization

png(filename = "images/graph_cloud_negative.png",
    width = 10 * 150,
    height = 7 * 150,
    res = 150)

imdb_sentiment_big %>%
  filter(sentiment == "negative") %>%
  with(wordcloud(
    word,       
    n,          
    max.words = 50, 
    random.order = FALSE, 
    colors = "firebrick", 
    scale = c(4, 0.8) 
  ))
dev.off()
## png 
##   2
png(filename = "images/graph_cloud_positive.png",
    width = 10 * 150,
    height = 7 * 150,
    res = 150)
imdb_sentiment_big %>%
  filter(sentiment == "positive") %>%
  with(wordcloud(
    word,       
    n,          
    max.words = 50, 
    random.order = FALSE, 
    colors = "navy", 
    scale = c(4, 0.8) 
  ))
dev.off()        
## png 
##   2

Next, I generated a frequency plot to identify the top 3 words. It will be used in the next step.

#term frequency plot
imdb_sentiment <- movie_tidy %>%
  inner_join(custom_bing_lexicon, by = "word") %>% 
  count(word, sentiment, sort = TRUE) %>%
  group_by(sentiment) %>%                   
  slice_max(n, n = 20) %>%      
  ungroup()
## Warning in inner_join(., custom_bing_lexicon, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 462945 of `x` matches multiple rows in `y`.
## ℹ Row 2822 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
imdb_sentiment
## # A tibble: 40 × 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 bad    negative   2566
##  2 hard   negative   1207
##  3 wrong  negative    780
##  4 dark   negative    778
##  5 death  negative    698
##  6 lost   negative    668
##  7 worst  negative    633
##  8 dead   negative    582
##  9 boring negative    562
## 10 evil   negative    548
## # ℹ 30 more rows
graph_imdb_sentiment <- imdb_sentiment %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(x = n, y = word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(
    title = "Sentiment of Movie review",
    x = "Term Frequency",
    y = NULL
  ) 
graph_imdb_sentiment

ggsave(filename = "images/graph_movie_sentiment.png",
       plot = graph_imdb_sentiment,                  
       width = 10,                                   
       height = 7,                                   
       dpi = 150)

3. Identification of Characteristic Words by Major Genres (Comedy, Horror, Fantasy)

To pinpoint words characteristic of specific genres, I randomly selected three genres because the prevalence of multiple genres made it challenging to distinctly categorize every film. The chosen genres were comedy, horror, and fantasy. For reviews that listed multiple genres, I added a single genre column, treating each as a document for TF-IDF calculation.

The analysis of characteristic words revealed distinct patterns across genres. In comedy, frequently appearing abbreviations like ‘dem’, ‘hart’, and ‘hepbum’ were likely the names of well-known actors. For horror, director names such as ‘bava’ and ‘fulci’, along with series titles like ‘hellraise’ and ‘videodrome’, showed high prominence. Finally, in fantasy, iconic characters including ‘conan’, ‘krull’, and ‘sparrow’ were notably prominent. This understanding helps us see what specific elements engage audiences within each genre.

#filtering Genre
comedy_tidy <- movie_tidy %>%
  filter(str_detect(Genre, "Comedy"))

horror_tidy <- movie_tidy %>%
  filter(str_detect(Genre, "Horror"))

fantasy_tidy <- movie_tidy %>%
  filter(str_detect(Genre, "Fantasy"))


#word count
comedy_word <- comedy_tidy %>%
  count(word, sort = TRUE) %>%
  ungroup() %>%
  mutate(genre_category = "Comedy")

horror_word <- horror_tidy %>%
  count(word, sort = TRUE) %>%
  ungroup() %>%
  mutate(genre_category = "Horror")

fantasy_word <- fantasy_tidy %>%
  count(word, sort = TRUE) %>%
  ungroup() %>%
  mutate(genre_category = "Fantasy")

target_genre <- bind_rows(
  comedy_word, horror_word, fantasy_word)
target_genre
## # A tibble: 68,446 × 3
##    word           n genre_category
##    <chr>      <int> <chr>         
##  1 movie       6832 Comedy        
##  2 film        5990 Comedy        
##  3 time        1671 Comedy        
##  4 story       1539 Comedy        
##  5 people      1464 Comedy        
##  6 movies      1455 Comedy        
##  7 funny       1359 Comedy        
##  8 films       1318 Comedy        
##  9 dont        1299 Comedy        
## 10 characters  1283 Comedy        
## # ℹ 68,436 more rows
genre_tf_idf_specific <- target_genre %>%
  bind_tf_idf(term = word,
              document = genre_category,
              n = n) %>%
  arrange(desc(tf_idf))

genre_tf_idf_specific
## # A tibble: 68,446 × 6
##    word           n genre_category       tf   idf   tf_idf
##    <chr>      <int> <chr>             <dbl> <dbl>    <dbl>
##  1 bava          26 Horror         0.000182 1.10  0.000200
##  2 maud          22 Horror         0.000154 1.10  0.000169
##  3 conan         31 Fantasy        0.000411 0.405 0.000167
##  4 dem           41 Comedy         0.000141 1.10  0.000155
##  5 hellraiser    20 Horror         0.000140 1.10  0.000154
##  6 hart          40 Comedy         0.000137 1.10  0.000151
##  7 excalibur     10 Fantasy        0.000133 1.10  0.000146
##  8 krull         10 Fantasy        0.000133 1.10  0.000146
##  9 videodrome    18 Horror         0.000126 1.10  0.000138
## 10 leia           9 Fantasy        0.000119 1.10  0.000131
## # ℹ 68,436 more rows
top_distinctive_word <- genre_tf_idf_specific %>%
  group_by(genre_category) %>%
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup()
top_distinctive_word
## # A tibble: 30 × 6
##    word          n genre_category        tf   idf    tf_idf
##    <chr>     <int> <chr>              <dbl> <dbl>     <dbl>
##  1 dem          41 Comedy         0.000141  1.10  0.000155 
##  2 hart         40 Comedy         0.000137  1.10  0.000151 
##  3 hepburn      34 Comedy         0.000117  1.10  0.000128 
##  4 gwaan        32 Comedy         0.000110  1.10  0.000121 
##  5 celebrity    26 Comedy         0.0000892 1.10  0.0000980
##  6 chaplin      26 Comedy         0.0000892 1.10  0.0000980
##  7 drunken      26 Comedy         0.0000892 1.10  0.0000980
##  8 superbad     25 Comedy         0.0000858 1.10  0.0000943
##  9 bernie       24 Comedy         0.0000824 1.10  0.0000905
## 10 pixar        62 Comedy         0.000213  0.405 0.0000863
## # ℹ 20 more rows
#visualization
graph_genre_tf_idf <- top_distinctive_word %>%
  mutate(genre_category = factor(genre_category, levels = c("Comedy", "Horror", "Fantasy"))) %>%
  ggplot(aes(x = reorder_within(word, tf_idf, genre_category),
             y = tf_idf,
             fill = genre_category)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ genre_category, scales = "free_y", ncol = 3) + 
  coord_flip() + 
  scale_x_reordered() + 
  scale_y_continuous(labels = scales::scientific_format(digits = 2)) +
  labs(
    title = "Top Distinctive Words by Specific Movie Genre Categories (TF-IDF)",
    x = NULL,
    y = "TF-IDF"
  )

graph_genre_tf_idf

ggsave(filename = "images/graph_genre_tfidf.png",
       plot = graph_genre_tf_idf,                  
       width = 10,                                   
       height = 7,                                   
       dpi = 150)

Conclusion

To get more positive movie reviews, a subtle emotional approach is crucial, given the diverse positive language used. Positive reactions often occur when a film’s atmosphere and genre truly connect with the audience, as shown by how sentiment words appear together. For fewer negative reviews, a new strategy for character development would be helpful. Lastly, what audiences pay attention to varies by genre: actors, directors, or characters. This means movie promotions should be intuitive and specific to each genre.

In short, successful positive audience engagement requires careful marketing that highlights the film’s atmosphere, and it’s key to remember that audience priorities change with the movie genre.