[AUTOMATED TEXT ANALYSIS] Final report_20201097 이민경

Executive summary

This project investigates how emotional polarity words—positive or negative—are expressed differently in IMDB movie reviews. By applying the bing sentiment lexicon, I analyze which types of words tend to appear in reviews of different sentiment labels and which ones are most distinguishing. This analysis contributes to better understanding of how positive and negative sentiment are linguistically conveyed in movie reviews.

Data background

I use the IMDB 50K dataset, a widely used corpus for sentiment classification. It contains 50,000 movie reviews labeled as either “positive” or “negative”. Each review is a free-text description of a viewer’s opinion about a movie.

reviews <- read_csv("IMDB Dataset.csv") %>%
  mutate(review = str_remove_all(review, "<br />"))
## Rows: 50000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): review, sentiment
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(reviews)
## Rows: 50,000
## Columns: 2
## $ review    <chr> "One of the other reviewers has mentioned that after watchin…
## $ sentiment <chr> "positive", "positive", "positive", "negative", "positive", …
reviews %>%
  count(sentiment) %>%
  kable()
sentiment n
negative 25000
positive 25000

Data loading, cleaning and preprocessing

In this section, I tokenize the review text into individual words, remove stopwords, and filter the remaining tokens by joining them with the Bing sentiment lexicon.

data("stop_words")
bing_lex <- get_sentiments("bing") %>%
  rename(word_sentiment = sentiment)

tidy_reviews <- reviews %>%
  mutate(id = row_number()) %>%
  unnest_tokens(word, review) %>%
  anti_join(stop_words, by = "word") %>%
  inner_join(bing_lex, by = "word")
## Warning in inner_join(., bing_lex, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1213804 of `x` matches multiple rows in `y`.
## ℹ Row 5781 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Text data analysis

In this section, I conduct a sentiment analysis of IMDB movie reviews using the Bing lexicon and explore how positive and negative sentiment words are expressed in text.

Individual analysis and figures

The following three visualizations present the key findings from my analysis, each focusing on a different aspect of sentiment word usage.

Analysis and Figure 1

I count the number of positive and negative words that appear in reviews labeled as “positive” or “negative”. This comparison allows me to see whether review sentiment correlates with word-level sentiment from the lexicon.

I chose a grouped bar chart (using geom_col(position = "dodge")) because it clearly shows the contrast between the frequency of positive and negative words across different sentiment categories. The bars offer an immediate visual comparison, and the use of color enhances readability. This design helps convey the overall balance of sentiment expression in a concise and accessible format.

word_counts <- tidy_reviews %>%
  count(sentiment, word_sentiment)

plot1 <- ggplot(word_counts, aes(x = word_sentiment, y = n, fill = sentiment)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = comma) +
  labs(title = "Positive vs Negative Words Used in Reviews",
       x = "Lexicon Sentiment", y = "Word Count") +
  theme_minimal()

ggsave("images/bing_sentiment_distribution.png", plot1, width = 7, height = 5)
plot1

Analysis and Figure 2

Here, I identify the most common sentiment words that appear in each sentiment category. To enhance readability, I use a faceted bar chart that separates words by sentiment polarity. However, because a single word can appear in both positive and negative reviews, this stacked visual can be visually ambiguous. To improve clarity, I switch to a faceted dot plot that better highlights per-word sentiment frequency split.

I chose the dot plot because it reduces visual clutter compared to stacked or side-by-side bar charts, especially when the same word appears in both sentiment categories. Dots allow readers to clearly compare the relative frequency of each word by sentiment, while facets help organize the chart by word polarity. This design emphasizes clarity and readability when displaying overlapping linguistic features.

top_bing_words <- tidy_reviews %>%
  count(sentiment, word, word_sentiment, sort = TRUE) %>%
  group_by(sentiment, word_sentiment) %>%
  slice_max(n, n = 7)

plot2 <- ggplot(top_bing_words, aes(x = n, y = reorder_within(word, n, word_sentiment), color = sentiment)) +
  geom_point(size = 3) +
  facet_wrap(~word_sentiment, scales = "free") +
  scale_y_reordered() +
  scale_x_continuous(labels = comma) +
  labs(title = "Top Words by Bing Sentiment Type",
       x = "Frequency", y = "Word") +
  theme_minimal()

ggsave("images/bing_top_words_by_sentiment_type.png", plot2, width = 8, height = 6)
plot2

Analysis and Figure 3

I chose to use Term Frequency-Inverse Document Frequency (TF-IDF) instead of a wordcloud to highlight not just frequently used words, but those that are most distinctive to each sentiment category. TF-IDF helps uncover nuanced vocabulary that sets positive and negative reviews apart, which a wordcloud—based purely on frequency—cannot capture as effectively. This choice also aligns with the goal of revealing deeper insights from textual patterns.

This section uses Term Frequency-Inverse Document Frequency (TF-IDF) to determine which words are not only frequent, but also uniquely associated with positive or negative reviews. This helps reveal the vocabulary that most strongly characterizes each sentiment class.

I designed the visualization as a faceted horizontal bar chart using geom_col() with TF-IDF scores on the x-axis. The flipped coordinates improve readability for long word labels. I used different fill colors for each sentiment category to help distinguish the two classes visually without a cluttered legend. The minimalist theme keeps the viewer’s attention on the data rather than decorative elements. This chart design effectively conveys truth by clearly surfacing the top differentiating words across sentiment groups.

tfidf_words <- tidy_reviews %>%
  count(sentiment, word) %>%
  bind_tf_idf(word, sentiment, n) %>%
  group_by(sentiment) %>%
  slice_max(tf_idf, n = 10) %>%
  ungroup()

plot3 <- ggplot(tfidf_words, aes(x = reorder_within(word, tf_idf, sentiment), y = tf_idf, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(labels = percent_format(accuracy = 0.001)) +
  labs(title = "Top TF-IDF Words by Review Sentiment",
       x = "Word", y = "TF-IDF") +
  theme_minimal()

ggsave("images/bing_tfidf_by_sentiment.png", plot3, width = 7, height = 5)
plot3

Conclusion

Through a multi-step sentiment analysis using the Bing lexicon, I uncovered clear differences in the use of sentiment words across IMDB movie reviews labeled as positive and negative.

In the first stage, I found that negative reviews include significantly more negative words, while positive reviews are rich in positive expressions. However, the overlap—such as the presence of negative words in positive reviews—suggests that reviewers often express mixed or nuanced opinions, even within clearly labeled categories.

The top sentiment words analysis revealed that certain words, like bad, worst, and boring, are highly concentrated in negative reviews, while love, excellent, and beautiful dominate positive reviews. I used a faceted dot plot instead of a bar chart to better illustrate overlapping word usage across sentiment groups, enhancing visual clarity.

The TF-IDF analysis proved especially valuable in surfacing distinctive terms not merely based on frequency, but on their uniqueness within each sentiment group. Words like uncreative, wretchedly, and harrow stood out in negative reviews, while openness, jubilant, and contentment characterized positive ones. This reflects how strong sentiment is often expressed through emotionally charged, context-specific vocabulary.

Overall, this project demonstrates how lexicon-based sentiment analysis and TF-IDF weighting can work in tandem to uncover both broad patterns and subtle differences in textual data. These findings can inform applications in automated review classification, customer opinion mining, and emotional tone detection in digital media.