Executive summary

IMDB site is an abbreviation for ‘Internet Movie DataBase’. The site provides user reviews and ratings for various movies and contents, and ranks them accordingly.

With the number of viewers of movies and TV series through OTT channels such as Netflix continuously increasing, the analysis of viewers’ reactions, or reviews, is considered a very important part of content production.

So, my main question is this: What kinds of sentimental words are most frequently used in positive vs. negative movie reviews on IMDB? How do word associations differ by sentiment, and what insights can we gain for content creation?

The prediction is that positive reviews will contain more words related to personal satisfaction (e.g., “amazing”, “joyful”, “love”), while negative reviews will use more words associated with disappointment or criticism (e.g., “boring”, “worst”, “hate”).

And the co-occurrence network of positive reviews will show more emotionally cohesive or reinforcing terms, while negative reviews will have more diverse or fragmented clusters reflecting different types of criticism.

This project aims to explore the sentimental polarity of movie reviews in IMDB to uncover linguistic patterns that distinguish between positive and negative feedback. Understanding this can improve automated sentiment detection in recommendation systems and media analysis.

Data background

In this analysis, the ‘IMDB Data set of 50K Movie Reviews’ created by kaggle.com was used. Movie reviews are all composed of English, with a total of 50,000 reviews in the data set.

This data set consists of only two columns, one of ‘review’ columns and one of ‘sentiment’ columns. Each ‘review’ column contains the full text of the movie reviews written by users, and the ‘sentiment’ column contains information on whether the review is positive or negative.

Data loading, cleaning and preprocessing

First, the original data is imported into ‘movie_review’. After that, in ‘review_clean’, the ‘review’ column is tokenized into word and the stop words are removed. It also filters numbers and ‘br’, which is the html code. And then ‘sentiment’ column was renamed to the ‘sentiment_entire’ column to distinguish it from the Bing, NRC sentiment columns.

movie_review <- read.csv("IMDB Dataset.csv")

review_clean <- movie_review %>% 
  unnest_tokens(word, review) %>% 
  anti_join(stop_words) %>% 
  filter(!word == "br") %>%  #Remove the html code 'br'
  filter(!str_detect(word, "^[0-9]+$")) %>% #Remove numbers
  rename(sentiment_entire = sentiment) #Distinguish sentiment for entire review

## Joining with `by = join_by(word)`

Text data analysis

Individual analysis and figures

Anaysis and Figure 1

I would like to find sentimental words with top 10 TF-IDF frequencies for each positive and negative review.

First, calculate the TF-IDF value of the entire review token, and then filter the sentiment-related words using Bing lexicons. And then, I will extract sentimental words with top 10 TF-IDF from each positive and negative review and make a bar graph.

#data = pre-processed data (tokenized, removed stop_words, html code, and numbers)

#Find the frequency of words
frequency <- review_clean %>%
  count(sentiment_entire, word, sort = T)

#Find the TF-IDF
frequency <- frequency %>% 
  bind_tf_idf(term = word,
              document = sentiment_entire,
              n = n) %>% 
  arrange(-tf_idf)

When extracting the top 10 tf-idf values without combining with Bing lexicons, the result values are mainly produced based on proper nouns such as movie titles and characters, so I decided to create a bar graph after filtering with the Bing lexicons.

#Key word extraction with Bing lexicons
top10 <- frequency %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  rename(sentiment_bing = sentiment) %>% #To clarify the distinction
  group_by(sentiment_entire) %>% 
  slice_max(tf_idf, n = 10, with_ties = F)

## Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 119387 of `x` matches multiple rows in `y`.
## ℹ Row 4621 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

#Create a bar graph
ggplot(top10, aes(x= reorder_within(word, tf_idf, sentiment_entire),
                     y= tf_idf,
                     fill = sentiment_entire))+
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~sentiment_entire, scales = "free") +
  scale_x_reordered()+
  labs(title = "Top 10 TF-IDF Words per Review Sentiment",
       x= NULL)+
  scale_fill_manual(values = c("negative" = "lightblue",
                               "positive" = "yellow"))+
  theme_light()

This graph visualizes the results of extracting sentimental words that appear relatively important in the review based on the TF-IDF value when the overall sentiment (sentiment_entire) is ‘positive’ or ‘negative’.

In positive reviews, words that emotionally positive nuances or impressions, such as openness, contentment, enchant, immaculately are at the top. On the other hand, in negative reviews, words that describe negative emotions or negative experiences, such as uncreative, wretchedly, and incoherently, are prominent.

However, words with a negative meaning such as “uneasiness” may appear in positive reviews, which may be negative in terms of just one word.

For instance, in the following review:

“The plot is cleverly enveloped in the Cuban Missile Crisis… and a general mist of and fear in the air…”

In this case, “uneasiness” is not a complaint about the movie, but a word used to describe the background atmosphere of the movie.

As such, TF-IDF is good at finding words that are frequently used in certain sentiment reviews, but the words may not always express sentiments directly. Nonetheless, TF-IDF is useful for understanding how language use varies depending on sentiments, and visualizing them with bar graphs can help you intuitively understand the difference.

Anaysis and Figure 2-1

Now let’s look at the word cloud for each sentiment.

This visualization considers the sentiments of the entire review (sentiment_entire) and the word-level sentiment (sentiment_bing), showing the case that negative words are included in positive reviews and positive words in negative reviews. Through this, it is possible to confirm the linguistic characteristics that sentiment can be mixed or expressed in combination within a single review.

pos_review <- review_clean %>%
  inner_join(get_sentiments("bing"), by = "word") %>% 
  rename(sentiment_bing = sentiment) %>% 
  filter(sentiment_entire == "positive") %>%
  #Filter only positive reviews in the context of the entire review
  count(sentiment_bing, word, sort = TRUE)

## Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1194016 of `x` matches multiple rows in `y`.
## ℹ Row 5781 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

word_cloud1 <- pos_review %>%
  acast(word ~ sentiment_bing, value.var = "n", fill = 0)

#Create a comparison word cloud
suppressWarnings(
comparison.cloud(word_cloud1,
                 colors = c("lightblue", "red"), 
                 #negative: light blue, positive: red
                 max.words = 100,
                 random.order = FALSE,
                 title.size = 1.5)
)

neg_review <- review_clean %>%
  inner_join(get_sentiments("bing"), by = "word") %>% 
  rename(sentiment_bing = sentiment) %>% 
  filter(sentiment_entire == "negative") %>%
  #Filter only negative reviews in the context of the entire review
  count(sentiment_bing, word, sort = TRUE)

## Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1194016 of `x` matches multiple rows in `y`.
## ℹ Row 5781 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

word_cloud2 <- neg_review %>%
  acast(word ~ sentiment_bing, value.var = "n", fill = 0)

#Create a comparison word cloud
suppressWarnings(
comparison.cloud(word_cloud2,
                 colors = c("lightblue", "red"), 
                 #negative: light blue, positive: red
                 max.words = 100,
                 random.order = FALSE,
                 title.size = 1.5)
)

In fact, negative words such as ‘bad’ and ‘hard’ appeared in positive reviews, and positive words such as ‘love’ and ‘pretty’ appeared in negative reviews at a certain rate. From this, it can be seen that even within a positive review, not only positive words appear, but also negative words are mixed to explain sentimental changes or plots.

Anaysis and Figure 2-2

However, although word cloud is intuitive to see visually, it is difficult to numerically judge the relative weight or distribution of sentiments. Accordingly, the distribution of each sentiment was visualized as a bar graph using the NRC emotion dictionary.

#data = pre-processed data (tokenized, removed stop_words, html code, and numbers)
#load the NRC lexicons
load("nrc.rda")

#Bind with NRC lexicons
nrc_pos <- review_clean %>% 
  inner_join(nrc, by = "word") %>%
  rename(sentiment_nrc = sentiment) %>% #To clarify the distinction
  filter(sentiment_entire == "positive") %>% #Filtering only positive reviews
  count(sentiment_nrc, sort = T)

## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 10 of `x` matches multiple rows in `y`.
## ℹ Row 13389 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

#Percentage calculation
nrc_prop1 <- nrc_pos %>%
  group_by(sentiment_nrc) %>%
  summarise(count = sum(n)) %>%
  mutate(prop = count / sum(count),
         label = paste0(round(prop * 100, 1), "%")) 

#Make a distribution graph
ggplot(nrc_prop1, aes(x = reorder(sentiment_nrc, prop),
                      y = prop,
                      fill = sentiment_nrc)) +
  
  geom_col(width = 0.6) +
  
  geom_text(aes(label = label), hjust = 1.1, color = "white") +
  coord_flip() + 
  
  labs(title = "Sentimental distribution in positive reviews",
       x = "Sentiment",
       y = "Percentage") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

nrc_neg <- review_clean %>% 
  inner_join(nrc, by = "word") %>%
  rename(sentiment_nrc = sentiment) %>% 
  filter(sentiment_entire == "negative") %>% #Filtering only negative reviews
  count(sentiment_nrc, sort = T)

## Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 10 of `x` matches multiple rows in `y`.
## ℹ Row 13389 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

#Percentage calculation
nrc_prop2 <- nrc_neg %>%
  group_by(sentiment_nrc) %>%
  summarise(count = sum(n)) %>%
  mutate(prop = count / sum(count),
         label = paste0(round(prop * 100, 1), "%")) 

#Make a distribution graph
ggplot(nrc_prop2, aes(x = reorder(sentiment_nrc, prop),
                      y = prop,
                      fill = sentiment_nrc)) +
  
  geom_col(width = 0.6) +
  
  geom_text(aes(label = label), hjust = 1.1, color = "white") +
  coord_flip() + 
  
  labs(title = "Sentimental distribution in negative reviews",
       x = "Sentiment",
       y = "Percentage") +
  theme_minimal(base_size = 12)+
  theme(legend.position = "none")

Through this analysis, it is possible to understand more clearly that sentiments are acting in a complex way beyond just the divided categories of positive and negative.

As such, sentimental analysis is useful for grasping the multi-layer components of sentiments. However, dictionary-based approaches such as NRC, Bing require attention to interpretation because they classify the meaning of words in a fixed label without context. For example, there are cases in which neutral words are classified as negative emotions, such as ‘plot’, so the context of the text must be considered together when interpreting the results.

Anaysis and Figure 3-1

The purpose of this analysis is to understand how sentimental expressions are organized and connected within positive and negative reviews by visualizing the co-occurrence network between sentimental words extracted based on the Bing lexicons.

# Since each review must be given a unique document number to construct a co-occurrence network, the original 'movie_review' data is imported and processed.

id_review <- movie_review %>% 
  mutate(review_id = row_number()) %>% #Add review_id
  unnest_tokens(word, review) %>% 
  anti_join(stop_words) %>% 
  filter(!word == "br") %>% 
  filter(!str_detect(word, "^[0-9]+$")) %>% 
  rename(sentiment_entire = sentiment)

## Joining with `by = join_by(word)`

#Filtering only sentimental words in Positive Reviews
pos_words <- id_review %>%
  filter(sentiment_entire == "positive") %>%
  inner_join(get_sentiments("bing"), by = "word")

## Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 600352 of `x` matches multiple rows in `y`.
## ℹ Row 5781 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

#Sentimental word pair extraction by review ID
word_pairs_pos <- pos_words %>%
  pairwise_count(item = word,
                 feature = review_id,
                 sort = TRUE, upper = FALSE)

#Network Visualization
set.seed(1234)

graph_review1 <- word_pairs_pos %>%
  filter(n >= 250) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree()) #Expressing network as connection centrality

set.seed(1234)  

ggraph(graph_review1, layout = "fr") +
  geom_edge_link(color = "gray50",
                 alpha = 0.5) +
  geom_node_point(aes(size = centrality),
                  color = "red", 
                  show.legend = FALSE) +
  
  scale_size(range = c(5, 10)) +
  
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  
  theme_graph() +
  labs(title = "Co-occurrence Network of Sentimental Words-Positive")

#Filtering only sentimental words in Negative Reviews
neg_words <- id_review %>%
  filter(sentiment_entire == "negative") %>%
  inner_join(get_sentiments("bing"), by = "word")

## Warning in inner_join(., get_sentiments("bing"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 680443 of `x` matches multiple rows in `y`.
## ℹ Row 6786 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

#Sentimental word pair extraction by review ID
word_pairs_neg <- neg_words %>%
  pairwise_count(item = word,
                 feature = review_id,
                 sort = TRUE, upper = FALSE)

#Network visualization
set.seed(1234)

graph_review2 <- word_pairs_neg %>%
  filter(n >= 300) %>% #
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree())

set.seed(1234)  

ggraph(graph_review2, layout = "fr") +
  geom_edge_link(color = "gray50",
                 alpha = 0.5) +
  geom_node_point(aes(size = centrality),
                  color = "lightblue",
                  show.legend = FALSE) +
  
  scale_size(range = c(5, 10)) +
  
  geom_node_text(aes(label = name), repel = TRUE, size = 3) +
  
  theme_graph() +
  labs(title = "Co-occurrence Network of Sentimental Words-Negative")

Positive reviews showed some strong positive word-centered concentration structures, while negative reviews showed a more widely distributed connection structure of various sentiments.

These results suggest that sentimental expression is acting as a connected sentimental network rather than an isolated word within a review, suggesting that future sentimental analysis models need to consider the position or interaction of words.

Anaysis and Figure 3-2

In this analysis, a bi-gram network was constructed based on the entire review text without filtering of Bing lexicons. This is an approach that is not limited to sentimental words, but can better reflect the natural language use and contextual connectivity.

# To construct the bi-gram network, the original 'movie_review' data set must be reloaded, as bi-gram tokenization requires access to the raw, untokenized text.

#Bi-gram data processing
review_pos_bigram <- movie_review %>%
  mutate(review_id = row_number()) %>% 
  filter(sentiment == "positive") %>% 
  unnest_tokens(bigram, review, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram))

#Dividing each bi-gram into word1 and word2 based on space
review_pos_seperated <- review_pos_bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

#Removing stop words, numbers, and html code
pos_bigrams_filtered <- review_pos_seperated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  filter(!str_detect(word1, "^\\d+$"),
         !str_detect(word2, "^\\d+$")) %>%
  filter(word1 != "br", word2 != "br") 

pos_bigrams_counted <- pos_bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

pos_bigrams_graph <- pos_bigrams_counted %>%
  filter(n > 200) %>% #To increase network graph visibility
  graph_from_data_frame()

set.seed(123)
ggraph(pos_bigrams_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), edge_colour = "gray50") +
  geom_node_point(color = "tomato", size = 2) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1, repel = TRUE) +
  theme_void() +
  labs(title = "Bigram Network of Positive Reviews")

#Bi-gram data processing
review_neg_bigram <- movie_review %>%
  mutate(review_id = row_number()) %>% 
  filter(sentiment == "negative") %>% 
  unnest_tokens(bigram, review , token = "ngrams", n = 2) %>%
  filter(!is.na(bigram))

review_neg_seperated <- review_neg_bigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

neg_bigrams_filtered <- review_neg_seperated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  filter(!str_detect(word1, "^\\d+$"),
         !str_detect(word2, "^\\d+$")) %>%
  filter(word1 != "br", word2 != "br") 

neg_bigrams_counted <- neg_bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

neg_bigrams_graph <- neg_bigrams_counted %>%
  filter(n > 200) %>%
  graph_from_data_frame()

set.seed(123)
ggraph(neg_bigrams_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), edge_colour = "gray50") +
  geom_node_point(color = "lightblue", size = 2) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1, repel = TRUE) +
  theme_void() +
  labs(title = "Bigram Network of Negative Reviews")

In positive reviews, there were many networks such as ‘romantic comedy’ and ‘highly recommended’, and in negative reviews, words such as ‘worst’, ‘bad’, and ‘plot holes’ were seen.

This allowed a clearer understanding of the trends and topic flows of word pairs frequently used in positive and negative reviews.

In particular, it is meaningful in that by using bi-grams rather than single word-centered analysis, it was possible to grasp more realistically in what context and how sentimental expression is used.

Furthermore, it is expected that it will be able to contribute to understanding the characteristics of the actual language-based language in designing an sentiment analysis model that reflects sentence structure or vocabulary context in the future.

Conclusion

Based on ‘IMDB movie review data’, this analysis examined the characteristics of sentimental expression by comparing the association with sentimental words that frequently appear in positive and negative reviews.

This study suggests that when creating content and developing an sentimental analysis model, the complexity of context and sentiment should be considered beyond a simple dichotomous approach. Furthermore, it can also provide insights into sentimental factors that induce positive reactions.

ATA_Final_Report

Yeonjae-Lee

6/17