title: ‘From Plot to Review: A Study of Positive and Negative Audience Evaluations Triggered by Movie Plots’
author: “채의청”
date: “2024/6/18”
output:
word_document: default
html_document:
code_folding: hide
fig_caption: true

# Add any libraries and general settings up here.
# I suggest you start with these two libraries, since you'll probably use them:
library(tidyverse)
library(tidytext)
library(readr)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)
library(wordcloud)
library(widyr)

Write text and code here.

Executive summary

What is (are) your main question(s)?
What is your story?
What does the final graphic show?

Q1: What type of movie plots trigger consumers’ positive or negative movie reviews ?
Q2:When assessing a film, what additional pertinent components are typically highlighted by viewers when discussing the plot?

K-content is receiving enormous love globally, with the core being the content itself.
Films like and won consecutive awards at the 2020 and 2021 Academy Awards.
While the technical aspect of filmmaking is important, the content remains the key competitive edge by captivating audiences.
Not only in Korea but also globally, the film industry strives to produce quality content, researching what content can attract and lead viewers.
Content is structured with elements such as story, direction, and music, with the plot in the story being the fundamental and significant component that shapes events in a film.

And consumers typically reduce the uncertainty of movie consumption through movie evaluations ((임숙경and 김정은, 2020)), in other words, movie evaluations have a significant influence on consumers’ movie consumption.
Previous research (임숙경and 김정은, 2020) explored whether the textual content and language style are beneficial for online movie evaluations, revealing that evaluators’ personal information and objective content are useful, indicating that narrative and shorter evaluations are helpful.
As the content becomes more important, movie evaluations mentioning the plot can influence other consumers’ movie consumption decisions.
There is a need to study which movie plots trigger positive or negative movie review reactions from consumers and to identify the key elements that audiences frequently mention in relation to the movie’s plot during evaluations.
This information can serve as a reference for movie production teams to improve their plots and potentially induce positive movie evaluations, consequently generating more revenue by encouraging more people to consume movies.
출처: 임숙경 ( Sukgyeong Lim ),and 김정은 ( Jeongeun Kim ).
“텍스트 내용과 언어 스타일이 온라인 영화 리뷰 유용성에 미치는 영향.” 인문사회 21 11.3 (2020): 1575-1589.

I would like to analyze consumer interest in the movie content based on the IMDB Dataset of 50K Movie Reviews, to determine what areas consumers are most interested in when it comes to movie evaluations.
I aim to support the selection of the “plot” factor through data analysis.

The following two bigram charts show what people typically remark when referring to plots.
In the negative reviews of the plot, people often refer to scenes, minutes’, ‘poor’, special effects, bottom line, sex scenes, and so on.
The ‘line’, ‘story’, and ‘characters’ aspects are frequently noted in positive reviews, along with angels camera, cinema feature, real world, and so on.
Knowing what parts of the movie people usually focus on together while paying attention to the plot can provide direction for filmmaking improvements.

Data background

Explain where the data came from, what agency or company made it, how it is structured, what it shows, etc.

I used the IMDB Dataset of 50K Movie Reviews, which is a Large Movie Review Dataset containing 50,000 movie reviews.
It is designed for binary sentiment classification, with reviews labeled as either negative or positive.
This dataset consists of two label columns: “review” and “sentiment”.
It includes significantly more data than previous benchmark datasets.
Specifically, it comprises 25,000 highly polarized movie reviews for training and another 25,000 for testing purposes.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts.
(2011).
Learning Word Vectors for Sentiment Analysis.
The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.bib

Data loading, cleaning and preprocessing

Describe and show how you cleaned and reshaped the data

Text data analysis

##pre-processing

#Download the data 
movie_reviews <- read.csv("movie_reviews.csv") 
movie_reviews


#To clean the text, I removed all the <br> tags, which are just line breaks used by browsers to start a new line and have no real meaning. We created a new list of stop words, removed punctuation, and converted all the text to lowercase to make it easier to analyze later.
new_stop_words <- bind_rows(tibble(word = c("br"),  
                                      lexicon = c("word")), 
                               stop_words) ->new_stop_words
movie_reviews$review <- movie_reviews$review %>%
  str_replace_all("[[:punct:]]", "") %>%
  tolower()
movie_reviews

##tokenizaiton and remove the stop word 
review_tidy <- movie_reviews %>% 
    unnest_tokens(input = review,
                  output = word,
                  drop = F)%>%
  anti_join(new_stop_words)%>%
  count(sentiment,word, sort = T)
review_tidy

Reasons for Choosing “Plot” as the Variable

Before exploring what type of movie plots trigger positive or negative consumer reviews, it is crucial to first investigate which factors related to movie content frequently appear in the IMDB Dataset of 50K Movie Reviews.
This exploration includes examining whether the variable “plot” is frequently mentioned in reviews.

#Using a word clouds to visually analyze frequently occurring words in movie reviews 
review_tidy%>%
  with(wordcloud(word, n, max.words = 100)) 
#Select ten high-frequency words related to movie content 

elements_film <-c("plot","story","character","music","actor","director","script","performance","version","acting")

From the first word cloud chart, ten high-frequency words were selected: “plot,” “story,” “character,” “music,” “actor,” “director,” “script,” “performance,” “version,” and "acting."calculating the total frequency of each relevant word in both positive and negative categories.Also calculating the frequency of the word “plot” in both positive and negative sentiment, providing an initial understanding of the occurrence rate of the variable “plot” in positive and negative.
Using log-odds-ratio to compare the frequency differences of these words in positive and negative, we will analyze which words are more common in positive reviews and which ones are more common in negative reviews.

#count the frequency
word_frequency <- review_tidy%>%
  filter(word%in%elements_film)%>%
  group_by(sentiment)%>%
  summarize(total= sum(n))
word_frequency 

plot_frequency <-review_tidy %>%
  filter(word == "plot")%>%
  group_by(sentiment)%>%
  summarize(frequency = sum(n))
plot_frequency

# Calculate the log-ratio
review_wide<- review_tidy%>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = list(n= 0)) -> review_wide

review_wide<- review_wide%>%
  mutate(ratio_positive= ((positive + 1)/(sum(positive + 1))), 
         ratio_negative = ((negative + 1)/(sum(negative + 1)))) -> review_wide

#Add the odds ratio variable
review_wide <- review_wide%>%
  mutate(odds_ratio = ratio_positive/ratio_negative)
review_wide

review_wide <- review_wide%>%
  mutate(log_odds_ratio = log(odds_ratio))
review_wide

#Extract the log-ratio of the ten words related to movie content 
review_wide%>%
  filter(word%in%elements_film)-> elements_review_wide

  lor_review <- elements_review_wide%>%
  group_by(sentiments = ifelse(log_odds_ratio > 0, "positive", "negative")) %>%
    arrange(-log_odds_ratio) %>%
  select(word, log_odds_ratio, sentiments)
lor_review

# Create a bar graph  
ggplot(lor_review, aes(x = reorder(word, log_odds_ratio),
                  y = log_odds_ratio,
                  fill = sentiments)) +
  geom_col(show.legend = F) +
  coord_flip() +
  labs(x = NULL)

Exploring what type of movie plots trigger consumers’ positive or negative movie reviews

Separate Reviews into bigrams and extract bigrams containing “plot.” Calculate the TF-IDFin positive and negative reviews, and select the top 10 bigrams to identify key bigrams related to the “plot” variable.

#tokenizing text into bigrams  
bigrams_plot <- movie_reviews %>%
  unnest_tokens(bigram, review, token = "ngrams", n = 2) %>%
 filter(!is.na(bigram)) -> bigrams_plot

# Separating bigrams
review_separated <- bigrams_plot%>%
  separate(bigram, c("word1", "word2"), sep = " ")
review_separated

#removing the stop word
review_separated <-review_separated%>%
  filter(!word1 %in%new_stop_words$word) %>%
  filter(!word2 %in%new_stop_words$word)
review_separated 

#Observing aspects related to the plot 
review_plot<- review_separated %>%
  filter(word1 == "plot"| word2 == "plot") -> review_plot

plot_bigram <- review_plot%>%
  unite(bigram, word1, word2, sep = " ")%>%
  count(sentiment, bigram, sort =  T)
plot_bigram

#Explore tf_idf of bigram 
plot_tf_idf <- plot_bigram%>%
  bind_tf_idf(term = bigram,          
              document = sentiment,  
              n = n)%>%
  arrange(-tf_idf)-> plot_tf_idf

#select top10 
plot_tf_idf%>%
  group_by(sentiment) %>%
  slice_max(tf_idf, n = 10, with_ties = F) -> top10_tf_idf

##creat the two graph
top10_tf_idf$sentiment <- factor(top10_tf_idf$sentiment,
                          levels = c("negative","positive"))
ggplot(top10_tf_idf, aes(x = reorder_within(bigram, tf_idf, sentiment),
                  y = tf_idf,
                  fill = sentiment)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~ sentiment, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x = NULL)

To exploring what type of movie plots trigger consumers’ positive or negative movie reviews after evaluation.
investigate which bigrams are more important in positive reviews and which bigrams are more important in negative reviews.
Use log-odds ratio to further explore this.

plot_wide<- plot_bigram%>%
  pivot_wider(names_from = sentiment,
              values_from = n,
              values_fill = list(n= 0)) -> plot_wide

# Calculate the log-ratio
plot_wide <- plot_wide%>%
  mutate(ratio_positive= ((positive + 1)/(sum(positive + 1))), 
         ratio_negative = ((negative+ 1)/(sum(negative + 1)))) -> plot_wide
plot_wide <- plot_wide%>%
  mutate(odds_ratio = ratio_positive/ratio_negative)
plot_wide 

plot_wide  <- plot_wide %>%
  mutate(log_odds_ratio = log(odds_ratio))
plot_wide 

lor_top10 <- plot_wide %>%
  group_by(sentiment = ifelse(log_odds_ratio > 0, "postive", "negative")) %>%
  slice_max(abs(log_odds_ratio), n = 10, with_ties = F)%>%
  arrange(-log_odds_ratio) %>%
  select(bigram,log_odds_ratio, sentiment)
lor_top10

#creat the two bar graphs
ggplot(lor_top10, aes(x = reorder(bigram, log_odds_ratio),
                  y = log_odds_ratio,
                  fill = sentiment)) +
  geom_col(show.legend = F) +
  coord_flip() +
  labs(x = NULL)

Exploring when assessing a film, what additional pertinent components are typically highlighted by viewers when discussing the plot?

Using n-grams to focus on the word pairs that appear frequently in the sentences that appear in the plot.This can help understand thee overall relationship between the plot and reviews.

#Creating a Network Graph with N-grams 
review_tidy <- movie_reviews %>% 
    unnest_sentences(input = review,
                  output = sentences,
                  drop = F)%>%
  filter(str_detect(sentences, "\\bplot\\b")) -> plot_sentences

plot_negative <- plot_sentences%>%
  filter(sentiment == "negative") -> plot_negative
plot_positive <- plot_sentences%>%
  filter(sentiment == "positive") -> plot_positive

##tokenizing text into bigrams 
bigrams_negative <- plot_negative%>%
  unnest_tokens(bigram, 
                sentences, 
                token = "ngrams",
                n = 2)
bigrams_negative

## Separating bigrams
negative_separated <- bigrams_negative %>%
  separate(bigram, c("word1", "word2"), sep = " ")
negative_separated

## Find word pair frequencies
negative_counts <- negative_separated%>%
  filter(!word1 %in%new_stop_words$word) %>%
  filter(!word2 %in%new_stop_words$word)%>%
  count( word1, word2, sort = TRUE)%>%
  na.omit()
negative_counts

##Creating a Network Graph with N-grams(negative)

library(tidygraph)

negative_bigram <- negative_counts%>%
  filter(n >= 50) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),  
         group = as.factor(group_infomap())) 
negative_bigram
library(ggraph)
set.seed(1234)
ggraph(negative_bigram, layout = "fr") +
  geom_edge_link(color = "gray50",             
                 alpha = 0.5) +                
  geom_node_point(aes(size = centrality,       
                      color = group),          
                  show.legend = F) +           
  scale_size(range = c(5, 10)) +               
  geom_node_text(aes(label = name),
                 repel = T,
                 size = 5) +
  labs(title = "negative_plot") +
  theme_graph()

###Creating a Network Graph with N-grams(positive)
bigrams_positive <-plot_positive%>%
  unnest_tokens(bigram, 
                sentences, 
                token = "ngrams",
                n = 2)
bigrams_positive

positive_separated <- bigrams_positive%>%
  separate(bigram, c("word1", "word2"), sep = " ")
positive_separated


positive_counts <- positive_separated%>%
  filter(!word1 %in%new_stop_words$word) %>%
  filter(!word2 %in%new_stop_words$word)%>%
  count( word1, word2, sort = TRUE)%>%
  na.omit()
positive_counts

install.packages()
library(tidygraph)

positive_bigram <- positive_counts%>%
  filter(n >= 35) %>%
  as_tbl_graph(directed = F) %>%
  mutate(centrality = centrality_degree(),  
         group = as.factor(group_infomap())) 
positive_bigram
library(ggraph)
set.seed(1234)
ggraph(positive_bigram, layout = "fr") +
  geom_edge_link(color = "gray50",             
                 alpha = 0.5) +                
  geom_node_point(aes(size = centrality,       
                      color = group),          
                  show.legend = F) +           
  scale_size(range = c(5, 10)) +               
  geom_node_text(aes(label = name),
                 repel = T,
                 size = 5) +
  labs(title = "positive_plot") +
  theme_graph()

Individual analysis and figures

Anaysis and Figure 1

Describe and show how you created the first figure.
Why did you choose this figure type?

After tokenization and removing stop words, I obtained a tidy review dataset with columns such as “sentiment”, “word”, and “n”.
Using wordcloud() with this dataset, I created a visualization of the top 100 most common words in Movie Reviews.
In this graphical representation, words are displayed with varying font sizes based on their frequency in the text, with higher-frequency words appearing larger.
As expected, “movie” and “film” are the most frequent words in the reviews, displayed in the largest fonts.
Additionally, other words related to movie content that appear frequently include “plot”, “story”, “character”, “music”, and so on.
The reason I chose the Wordclouds visualization type is firstly because I want to understand which words related to movie content appear most frequently in Movie Reviews.
The Wordclouds indeed presents word frequency in a visually intuitive manner, where font size reflects how often words appear in the text.
This allows us to quickly identify which words are more frequent in the text.
Secondly I am more interested in understanding which aspects of movie content receive attention, rather than knowing the exact frequency of words in Movie Reviews.
Given the vastness of the Movie Reviews dataset, using a word cloud can clearly display high-frequency words related to movie content, preparing me to select variables

Anaysis and Figure 2

In this chart, the words “script” and “plot” appear more frequently in “negative” compared to “positive”, while “performance” and “music” appear more frequently in “positive” compared to “negative”.
In this study, the main variable of interest, “plot,” has a higher log-odds ratio in “negative” compared to “positive”, ranking second among the ten words.
“plot” is important in negative reviews than in positive ones, indicating that the “plot” variable may have a significant impact on eliciting negative consumer reviews.
This finding provides a foundational basis for further exploration and confirms the feasibility of “plot” as a variable in this study.

We can quantify and compare which of these 10 words tend to appear more in positive or negative reviews, Excludewords that appear with similar frequencies in both positive and negative reviews.This demonstrates the feasibility of “plot” as a variable, allowing us to assess the importance of “plot” in both positive and negative reviews.

Anaysis and Figure 3

In the log-odds ratio chart, “coherent plot,” “bad plot,” and “nonexistent plot” appear to be more significant in the “negative” , whereas “solid plot,” “entertaining plot,” and “amazing plot” are more prominent in the “positive” .

The puzzling aspect here is that typically, a coherent plot enhances consumers’ movie experience.
It appears more frequently in positive reviews.
However, in this chart, “coherent plot” is more emphasized in negative evaluations.
This could be due to the limitations of using bigrams in analysis.
For instance, the phrase “coherent plot” might be modified by negations (like “not coherent plot”) or compared unfavorably with plots in other movies, influencing its appearance in negative contexts.

In TF-IDF, you can observe when a word is uncommon but frequently used in a specific text.
Incoherent plot is not common but frequently used in the ‘negative’.

Another noteworthy term in the negative category is “gaping plot”.
And “Unbelievable plot” indicates that viewers find the plot developments in the movie unrealistic or implausible.
Most negative reviews focus on criticisms of plot coherence, reasonableness, and believability, often using terms like “bad plot,” “terrible plot,” or “stupid plot.” From the log-odds ratio, it is evident that positive reviews mainly emphasize the movie’s interesting, creative, and engaging plot aspects.
Audiences particularly appreciate solid, clever, brilliant, and unusual plot designs in movies.
However, the log_odds_ratio chart shows that “bad plot,” “nonexistent plot,” and “amazing plot” reflect audience sentiments.
However, terms like “bad” and “amazing” can be abstract and subjective, varying from person to person.
I chose to use bigrams instead of single words because my research focus is centered around plot, specifically exploring “What type of movie plots trigger consumers’ positive or negative movie reviews?” Therefore, bigrams related to plot can more effectively help me explore this topic compared to individual word.Through log-odds-ratio, I can compare the relative frequencies of words in positive and negative reviews, thereby intuitively observing the numerical differences in these plot types.

Based on log-odds-ratio, one can infer that “gaping plot” and “unbelievable plot” may trigger negative reviews, while “solid plot” and “unusual plot” may lead to positive reviews.

In TF-IDF, it can be inferred that “incoherent plot” appears frequently in negative texts, which influences the evaluations.

Anaysis and Figure 4

In the end, I chose Network Graphs with Bigrams to see the overall relationships between words in sentences containing “plot”.In negative reviews, common mentions such as “scenes”, “minutes”, “poor”, “special effects”, “bottom line”, and “sex scenes” suggest audience dissatisfaction or criticism regarding plot development, special effects quality, and overall film quality.

Conversely, frequent mentions in positive reviews include words like “line”, “story”, “characters”, “angles camera”, “cinema feature”, and “real world,” which likely reflect audience appreciation for plot clues, story content, character performances, and the technical aspects and real-world relevance of the film.

These insights can help filmmakers, critics, and researchers better understand the key factors influencing audience reviews.
It’s noted that sentences containing the word “plot” are more frequent in the negative category than in the positive category.
Therefore, for subsequent analysis to better visualize the overall word relationships in negative and positive reviews, I chose n > 50 for negative and n > 35 for positive.
This adjustment increased the number of words in the positive category.
If both negative and positive were n > 50, the positive would have relatively few words in the graph, which would not provide a good visual effect and would not greatly aid in understanding the overall word relationships.

#In showing the figures that you created, describe why you designed it the way you did.
Why did you choose those colors, fonts, and other design elements?
Does it convey truth?
At first, I identified ten high-frequency words through word clouds: “plot,” “story,” “character,” “music,” “actor,” “director,” “script,” “performance,” “version,” and “acting.” Using the frequency of the word “plot” in both positive and negative sentiments, and comparing these frequencies using log-odds-ratio, I aim to demonstrate why I selected “plot” as a variable.
I chose bigrams over single words when exploring what type of movie plots trigger consumers’ positive or negative movie reviews because my research question focuses on understanding which types of movie plots elicit these specific reactions.
Analyzing individual words alone cannot adequately address this question.
Therefore, I opted to analyze bigrams containing the word “plot” using if-idf and log-odds-ratio.
Movie reviews may mention specific movie titles, character names, and other details, making single-word analysis limited in comprehending movie evaluations.
This is why I selected bigrams.
With if-idf, I aimed to identify unique words in positive and negative reviews, while log-odds-ratio allowed me to further analyze the importance of bigrams in both types of reviews.
Using if-idf, I inferred which types of movie plots are significant and which may trigger consumers’ positive or negative movie reviews.
Ultimately, I want to explore when assessing a film, what additional pertinent components are typically highlighted by viewers when discussing the plot?
Therefore, for subsequent analysis to better visualize the overall word relationships in negative and positive reviews, I chose n > 50 for negative and n > 35 for positive.
This adjustment increased the number of words in the positive category.
If both negative and positive were n > 50, the positive would have relatively few words in the graph, which would not provide a good visual effect and would not greatly aid in understanding the overall word relationships.

The if-idf and log-odds-ratio charts provide useful insights.
For example, terms like “gaping plot” and “unbelievable plot” appear more frequently in negative than positive , potentially prompting negative reviews.
This information can guide future film productions to scrutinize these elements in the plot to minimize negative reviews.
However, terms like “bad plot” and “terrible plot” are abstract and subjective, varying in interpretation among viewers.
In terms of improving film production, this ambiguity can lead to misunderstandings or uncertainties instead.Using if-idf and log-odds-ratio helps identify unique bigrams in positive and negative reviews and their relative importance.
While they suggest possible associations, they don’t conclusively determine which plots lead to positive or negative reviews.
It is difficult to directly determine which plots trigger positive evaluations and which trigger negative evaluations, and there is a lack of research on the relationship between the two variables.

Finally, analyzing bigrams reveals direct word relationships in texts, such as understanding the semantic meaning of phrases like “bottom line” compared to the single word “bottom.” Using color = group to determine the node color based on its group, setting the node size range to 5, and choosing edge color as gray50, which does not interfere with the colors of other words presented.
This makes the gram Network Graph clearer.However, these charts may lack clarity in word clustering and often show nodes with numerous connections.

You can also include images like this: