Write text and code here

Executive summary

What is (are) your main question(s)? What is your story? What does the final graphic show?

This question matters because Barbie was not just a commercial success — it became a cultural phenomenon. With Greta Gerwig at the helm, the film mixed humor, fantasy, and social commentary, especially on gender dynamics and identity. These layers sparked diverse, and at times polarized, reactions from viewers worldwide. By analyzing IMDb reviews, this study aims to uncover which themes resonated most with audiences, what sentiments dominated user reactions, and whether there is evidence of ideological or emotional divides in audience perception.

The story behind this project is the transformation of Barbie from a plastic icon into a mirror of societal values and controversies. Through text mining and visualization techniques such as word frequency plots, sentiment analysis, and co-occurrence networks, the final graphics reveal the emotional landscape and thematic highlights embedded in the public’s response. They show not only what people said, but how they felt — offering insight into Barbie’s role in shaping (or challenging) cultural conversations.

Data background

Explain where the data came from, what agency or company made it, how it is structured, what it shows, etc.

This structured format makes the dataset well-suited for natural language processing and sentiment analysis. The reviews have already been cleaned, meaning that common preprocessing steps such as lowercasing, punctuation removal, and stopword filtering may have been performed in advance.

By analyzing this dataset, we aim to explore how viewers emotionally and thematically responded to Barbie (2023), as reflected in the language they used. With over thousands of real audience voices, this dataset provides a valuable lens through which to examine the film’s cultural and social impact.

Data loading, cleaning and preprocessing

Describe and show how you cleaned and reshaped the data - I first cleaned the data by filtering out any rows where the rating was not a number, since I wanted to make sure only valid ratings were included. I then focused on the text column and broke the reviews down into individual words using tokenization. After that, I removed common stop words like “the” and “is” that don’t add much meaning to the analysis. I also filtered out any numbers and words shorter than three letters to keep the data more focused and relevant. These steps helped me reshape the data into a clean list of important words that I could use for further analysis like word frequency and sentiment.

barbie <- read_csv("barbie_Cleaned.csv", show_col_types = FALSE)

barbie_clean <- barbie %>%
  filter(str_detect(rating, "^[0-9]+$"))

glimpse(barbie_clean)
## Rows: 790
## Columns: 2
## $ text   <chr> "Beautiful film, but so preachyLoveofLegacy21 July 2023Margot d…
## $ rating <chr> "6", "6", "8", "9", "7", "8", "6", "8", "6", "8", "6", "4", "1"…
barbie_words <- barbie_clean %>%
  select(text) %>%
  unnest_tokens(word, text)
data("stop_words")
barbie_words_clean <- barbie_words %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^[0-9]+$")) %>%
  filter(str_length(word) > 2)
barbie_words_clean
## # A tibble: 55,580 × 1
##    word                 
##    <chr>                
##  1 beautiful            
##  2 film                 
##  3 preachyloveoflegacy21
##  4 july                 
##  5 2023margot           
##  6 film                 
##  7 disappointing        
##  8 marketed             
##  9 fun                  
## 10 quirky               
## # ℹ 55,570 more rows

Text data analysis

# 1. Most Frequent Words
top_words <- barbie_words_clean %>%
  count(word, sort = TRUE)
top_words %>%
  slice_max(n, n = 20) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "hotpink") +
  coord_flip() +
  labs(title = "Top 20 Most Common Words in Barbie Reviews",
       x = "Word",
       y = "Frequency") +
  theme_minimal(base_size = 14)

# 2. Sentiment Analysis using Bing Lexicon
bing <- get_sentiments("bing")

barbie_sentiment <- barbie_words_clean %>%
  inner_join(bing, by = "word") %>%
  count(sentiment, sort = TRUE)

barbie_sentiment %>%
  ggplot(aes(x = sentiment, y = n, fill = sentiment)) +
  geom_col() +
  labs(title = "Overall Sentiment in Barbie Reviews",
       x = "Sentiment",
       y = "Word Count") +
  scale_fill_manual(values = c("positive" = "hotpink", "negative" = "black")) +
  theme_minimal()

barbie_words_clean %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(
    words = word, 
    freq = n, 
    max.words = 100, 
    colors = brewer.pal(8, "Set1"), 
    scale = c(2, 0.5)
  ))

Individual analysis and figures

Anaysis and Figure 1

Describe and show how you created the first figure. Why did you choose this figure type?

  • I created a bar chart titled “Top 20 Most Common Words in Barbie Reviews” to visually represent the most frequently used words that appeared across the reviews. I worked on this in R Studio, where I first cleaned the text data by removing unnecessary elements like punctuation, common stop words, and extra spaces to make sure the results would focus only on meaningful words. After preparing the text, I broke it down into individual words and calculated how often each word appeared in the dataset.

Once I had the word frequencies, I selected the top 20 most common words and used them to build the chart. I chose a horizontal bar chart because I felt it would be the clearest way to show the results. It’s much easier to read the word labels when they are listed vertically on the y-axis, especially when some of the words are longer. This type of figure also makes it really simple to compare the word frequencies at a glance, which is useful when the goal is to quickly spot which themes or names were most frequently mentioned in the reviews. Overall, this visual helps summarize a large amount of text data in a straightforward way that’s easy to understand, which is why I thought it was the best fit for this analysis.

Anaysis and Figure 2

  • For this figure, I created a bar chart that shows the overall sentiment in Barbie reviews, breaking it down into positive and negative categories. I worked on this performing a sentiment analysis, which involved identifying and categorizing words in the reviews as either positive or negative based on a sentiment dictionary. After processing the text, I counted how many positive and negative words appeared in total across all the reviews.

I decided to use a simple vertical bar chart for this figure because it clearly shows the contrast between the number of positive and negative words. The visual makes it easy to see that positive words appeared much more frequently than negative ones, which gives a quick overview of the general mood people had towards the Barbie movie. I also chose to color the bars differently—using black for negative and pink for positive—to make the comparison more visually striking and to tie in the pink color that is popularly associated with the Barbie brand. Overall, this type of chart was the best way to display the sentiment results in a clear, direct, and easily understandable way.

Anaysis and Figure 3

In showing the figures that you created, describe why you designed it the way you did. Why did you choose those colors, fonts, and other design elements? Does it convey truth?

  • For this figure, I created a word cloud to visually display the most common words used in the Barbie movie reviews. I used R Studio to process the text by cleaning it (removing stop words, punctuation, and unnecessary spaces) and then calculated word frequencies. I chose to use a word cloud because it’s a fun, creative way to present text data, and it quickly shows which words stand out in people’s reviews. Words that appear more frequently are shown in larger fonts, which makes it really easy for viewers to immediately see which themes or names were most popular.

I also used a simple, rounded font to keep the design playful and approachable, which fits the vibe of the movie and its audience. I didn’t want to use harsh colors or serious fonts because that would not have matched the cultural tone surrounding Barbie.

In terms of whether the figure conveys truth, I think word clouds can be a little limited because they only show frequency, not context or whether the word was used in a positive or negative way. However, for the purpose of giving a general overview of which words appeared the most, it’s a truthful and effective summary. It helps viewers quickly spot key themes, names like “Ken,” “Ryan,” “Robbie,” and even dates like “July,” which suggests people were talking about the release timing as well.

Overall, I chose this figure type because it’s visually engaging, easy to interpret, and it matched both the tone of the movie and the overall style I wanted to communicate.