Movie_summaries_analysis_assignment8

Comparing the IMDb summaries of Zootopia and The Conjuring using Tidy text scraping

For this assignment, we were tasked with comparing texts between two opposite subjects. I thought about using unstructured IMDb data first as I knew that scraping the top IMDb fan summaries would be easiest using the HTML element scraping method and piping that into a text command. Then I decided that comparing two very different movies would provide the most interesting sentiment words to compare using the bing and nrc lexicons. After choosing Zootopia, as the expected movie to have the most positively indexed words in its summaries, and The Conjuring as the opposite, I came up with a line of inquiry to answer with my analysis; By the standard of Bing and NRC words, will Zootopia have more of an upbeat sentiment analysis than the Conjuring since it is a kids movie, or will both movies contain similar sentiment descriptions since they both contain dark cinematic themes?

Step 1: Importing the data after scraping it into a script file using HTML element scraping and tokenizing the reviews of each movie by word.

Step 2: Analyzing the movie summaries by sentiments and visualizing sentiment counts (Bing)

zootopia_counts <-
  tidy_zoo %>% 
  group_by(word) %>%
  summarise(n=n()) %>% 
  inner_join(bing)
Joining with `by = join_by(word)`
conjuring_counts <-
  tidy_con %>% 
  group_by(word) %>%
  summarise(n=n()) %>% 
  inner_join(bing)
Joining with `by = join_by(word)`

Visualizing sentiment counts for The Conjuring:

#| echo: true
#| eval: true

conjuring_counts %>%
  filter(n > 1) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette="Set1") +
  labs(title = "Conjuring summaries sentiment scores by word",
       subtitle = "Scorable words appearing at least one  time",
       x = "Word",
       y = "Contribution to sentiment")

Visualizing sentiment counts for Zootopia:

zootopia_counts %>%
  filter(n > 1) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette="Set1") +
  labs(title = "Conjuring summaries sentiment scores by word",
       subtitle = "Scorable words appearing at least one  time",
       x = "Word",
       y = "Contribution to sentiment")

Bing Valence analysis for both movies using a combined data frame:

movies <-
  bind_rows(tidy_con,tidy_zoo)



### bing valence analysis

movies %>%  
  inner_join(bing, by = "word") %>% 
  mutate(index = row_number()) %>%        
  group_by(movie, index, sentiment) %>%   
  summarize(n = n(), .groups = "drop") %>% 
  tidyr::spread(sentiment, n, fill = 0) %>%  
  mutate(valence = positive - negative,
         color = ifelse(valence >= 0, "positive", "negative")) %>% 
  ggplot(aes(x = index, y = valence, fill = color)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c("positive" = "green", "negative" = "red")) +
  facet_wrap(~ movie, ncol = 2, scales = "free_x") +
  labs(
    title = "Chronological Valence Analysis",
    subtitle = "For IMDb summaries of Zootopia and The Conjuring",
    x = "Relative position in text",
    y = "Valence (positivity - negativity)"
  )

Step 3: Analyzing summary sentiment words using the NRC lexicon

conjuring_sentiment <- 
  tidy_con %>% 
  inner_join(nrc, by = "word", relationship = "many-to-many") %>% 
  group_by(sentiment) %>%
  summarize(`Count`=n(),
            `Percent of scoreable words` = `Count`/nrow(.),
            `movie`= "The Conjuring") %>% 
  arrange(-`Percent of scoreable words`)

zootopia_sentiment <- 
  tidy_zoo %>% 
  inner_join(nrc, by = "word", relationship = "many-to-many") %>% 
  group_by(sentiment) %>%
  summarize(`Count`=n(),
            `Percent of scoreable words` = `Count`/nrow(.),
            `movie`= "Zootopia") %>% 
  arrange(-`Percent of scoreable words`)


all_movie_sentiments <-
  bind_rows(conjuring_sentiment, zootopia_sentiment)



all_movie_sentiments %>% 
  ggplot(aes(x=sentiment, y = `Percent of scoreable words`, fill=movie))+
  geom_col(position = "dodge") +
  scale_fill_brewer(palette="Set1") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "A comparison of the emotive sentiments found in Zootopia and the Conjuring summaries",
       subtitle = "Using the NRC Lexicon (Mohammad and Turney, 2013), shown as a percent of scorable words",
       x = "Sentiment",
       fill = "Book") +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) 

Step 4: Inquiry Response and Visualization Analysis

To my surprise, the IMDb summaries for both Zootopia and The Conjuring contained a shared structure of rather “negatively” connotated words. When looking at the Bing sentiment word count visualization for the Conjuring, there were 7 scoreable Bing words which appeared more than once in summaries. These included “struggling”, “malevolent”, “dilapidated”, “demonic”, “slowly”, “evil”, and demon”. Regarding the same visualization ran for Zootopia, there were 17 scoreable words, however only three of those had a positive sentiment score. Those words included “savage”, “shrew”, “attacks”, “bully”, “confession”, “conspiracy”, “disappointed”, “discrimination”, “fear”, “illegal”, “protest”, “refuses”, “resignation”, “strange”, “oasis”, “protect”, and “trust”. The scoreable sentiment word count for Zootopia was likely larger because these summaries were longer and more detailed. The high number of negatively scoring words in both films were typically scoring around the -2 mark. Next, I ran a Bing valence analysis by using a combined data frame with both films sentiment counts. This would show the positive or negative score in a given area of text based on the valence equation of a count of positive scores minus negative scores. If negative, the graph would plot a -1, if positive, then a +1. For both film summaries, there was actually a similar ratio of negative valences to positive ones, however in the Conjuring, positive emotions were more evenly distributed throughout summaries whereas for Zootopia, positive emotions were highly concentrated in the begining of summaries.

The last visualization I ran using a combined data frame of both summaries’ scoreable NRC word categories. The types of words that the NRC lexicon identified in both groups included anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, and trust. Surprisingly, Zootopia had a higher percentage of negatively scoreable words as well as a higher percentage for words categorized in the anger and fear categories. Even with a lower total word count, The conjuring had a higher percentage of scoreable words in the joy category.

In conclusion, the Zootopia and Conjuring IMDb fan summaries had a much different sentimental outcome than what I had predicted. Zootopia summaries had a much more emotionally rich outlook than I had anticipated, incorporating many words that would provide evidence for a more intellectually stimulating film. This analysis supports the idea that kids movies can be more than just happy, positive pieces of media and horror movies can be balanced out by moments of relief and joy.