Executive summary

If you read just a single sentence of a movie review, could you guess whether it’s a positive or negative one? We challenged a smart algorithm to do just that — by analyzing 50,000 reviews and digging into the emotional vocabulary that makes a review glow or groan.

This project analyzes the IMDB movie review dataset to explore the emotional tone and vocabulary differences between positive and negative reviews. Using tools such as tidytext and the Bing sentiment lexicon, we tokenize the text, remove stop words, and conduct word frequency and sentiment analysis.

Through a series of visualizations, we uncover distinct patterns in how users express satisfaction or disappointment. The final charts show that positive reviews tend to use upbeat adjectives like “great” and “love”, while negative ones are saturated with words such as “bad” and “boring”. These findings provide insight into how emotions are linguistically manifested in online user-generated content.

Data background

The dataset used in this project is the IMDB Dataset of 50K Movie Reviews, which is publicly available on platforms like Kaggle. Originally curated for benchmarking sentiment classification models, it has become a staple in natural language processing (NLP) research.

The dataset contains 50,000 English-language movie reviews sourced from IMDb, evenly split between positive and negative sentiments, with no neutral entries. Each record includes: a review text (review) a sentiment label (sentiment): either “positive” or “negative”.

The reviews vary in length, tone, and vocabulary, making this dataset ideal for exploring emotional expression, lexical choices, and the alignment between language and sentiment.

Data loading, cleaning and preprocessing

imdb <- read_csv("IMDB Dataset.csv")

## Rows: 50000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): review, sentiment
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

imdb <- imdb %>% mutate(id = row_number())

head(imdb)

## # A tibble: 6 × 3
##   review                                                         sentiment    id
##   <chr>                                                          <chr>     <int>
## 1 "One of the other reviewers has mentioned that after watching… positive      1
## 2 "A wonderful little production. <br /><br />The filming techn… positive      2
## 3 "I thought this was a wonderful way to spend time on a too ho… positive      3
## 4 "Basically there's a family where a little boy (Jake) thinks … negative      4
## 5 "Petter Mattei's \"Love in the Time of Money\" is a visually … positive      5
## 6 "Probably my all-time favorite movie, a story of selflessness… positive      6

data("stop_words")

tidy_imdb <- imdb %>%
  unnest_tokens(word, review) %>%
  anti_join(stop_words, by = "word")

head(tidy_imdb)

## # A tibble: 6 × 3
##   sentiment    id word     
##   <chr>     <int> <chr>    
## 1 positive      1 reviewers
## 2 positive      1 mentioned
## 3 positive      1 watching 
## 4 positive      1 1        
## 5 positive      1 oz       
## 6 positive      1 episode

Text data analysis

We applied the tidytext framework to tokenize the reviews, remove standard stop words, and perform word-level analysis. Using the Bing sentiment lexicon, we classified and quantified emotional words, allowing us to compare sentiment-aligned vocabulary across positive and negative reviews.

Individual analysis and figures

Analysis and Figure 1

The first figure shows the top 10 most frequent words in positive and negative reviews respectively. We chose bar plots to clearly compare word usage between the two sentiments.

We chose a bar chart because it provides a clear, side-by-side comparison of word frequency across categories. Bar charts are effective for categorical comparisons and are easy to interpret visually.

word_freq <- tidy_imdb %>%
  count(sentiment, word, sort = TRUE)

top_words <- word_freq %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, n, sentiment))

ggplot(top_words, aes(x = word, y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top Words in Positive vs Negative Reviews",
       y = "Frequency", x = NULL)

This figure reveals that users often rely on emotionally charged or emphatic words when expressing strong opinions. For instance, positive reviews feature frequent praise words like “love,” while negative reviews highlight dissatisfaction with words such as “bad”.

The side-by-side bar chart format enables a direct comparison, making it easy to identify the dominant lexical patterns in each sentiment category. This confirms that vocabulary usage in reviews is strongly shaped by emotional tone. ## Analysis and Figure 2

This figure uses the bing sentiment lexicon to classify words as positive or negative, and counts their occurrences in positive and negative reviews. The purpose is to highlight whether emotional language aligns with the review’s overall sentiment.

The use of bing lexicon enables binary sentiment classification. We applied it to all tokenized words and aggregated their frequencies. This allows us to visually assess whether reviews’ word choices align with their given sentiment label.

bing <- get_sentiments("bing")

tidy_imdb_sentiment <- tidy_imdb %>%
  rename(review_sentiment = sentiment)

sentiment_words <- tidy_imdb_sentiment %>%
  inner_join(bing, by = "word", relationship = "many-to-many") %>%
  count(review_sentiment, sentiment) %>%
  rename(word_sentiment = sentiment)

sentiment_words %>%
  ggplot(aes(x = "", y = n, fill = word_sentiment)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  facet_wrap(~ review_sentiment) +
  labs(
    title = "Pie Chart of Emotional Word Proportion",
    fill = "Word Sentiment"
  ) +
  theme_void()

This chart illustrates the clear consistency between overall comment sentiment and the polarity of individual word sentiment. The vast majority of negative comments contain words in the “negative” category in Bing Dictionary, while positive comments are relatively neutral.

This consistency indicates that reviewers often express their negative views very directly, thereby enhancing the reliability of vocabulary based negative sentiment analysis in this field.

Analysis and Figure 3

The third visualization is a word cloud of the most common words in positive reviews. Word clouds are an engaging way to visualize word frequency, especially when highlighting dominant words in qualitative data.

The word cloud provides an intuitive snapshot of the most frequently used words in positive reviews. Words such as “fun,” “love,” and “enjoy” stand out prominently, signaling the emotional emphasis reviewers place on positive experiences.

Although word clouds do not reflect exact frequencies, their visual nature makes them effective for quickly identifying dominant terms and overall emotional tone. The use of the “Dark2” color palette adds contrast and clarity, highlighting keyword diversity.

positive_words <- tidy_imdb %>%
  filter(sentiment == "positive") %>%
  count(word, sort = TRUE)

wordcloud(words = positive_words$word,
          freq = positive_words$n,
          max.words = 100,
          random.order = FALSE,
          colors = brewer.pal(8, "Dark2"))

Word Cloud of Positive Reviews

The dominance of words like “fun”, “love”, and “pretty” in the word cloud confirms that positive reviews frequently include overt praise, showing a rich emotional vocabulary. While not precise, word clouds help surface the emotional weight of reviews at a glance.

Conclusion

This analysis demonstrates that sentiment in movie reviews is expressed through clearly distinguishable language patterns. From word choice to emotional polarity, user-generated content on platforms like IMDB reflects strong linguistic signals that correspond to overall sentiment.

These findings have broad applicability. The same techniques can be adapted to analyze customer feedback, social media discourse, or even political messaging — offering a scalable way to monitor public opinion in real time. As natural language processing continues to evolve, such approaches could power next-generation sentiment-aware systems for content recommendation, trend detection, and opinion mining.

Analysis of Emotional and Vocabulary Differences in IMDB Movie Reviews

WANG QIANQIAN

2025-06-17