[AUTOMATED TEXT ANALYSIS] Final report_20201097 이민경
This project investigates how emotional polarity words—positive or
negative—are expressed differently in IMDB movie reviews. By applying
the bing sentiment lexicon, I analyze which types of words
tend to appear in reviews of different sentiment labels and which ones
are most distinguishing. This analysis contributes to better
understanding of how positive and negative sentiment are linguistically
conveyed in movie reviews.
I use the IMDB 50K dataset, a widely used corpus for sentiment classification. It contains 50,000 movie reviews labeled as either “positive” or “negative”. Each review is a free-text description of a viewer’s opinion about a movie.
reviews <- read_csv("IMDB Dataset.csv") %>%
mutate(review = str_remove_all(review, "<br />"))
## Rows: 50000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): review, sentiment
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(reviews)
## Rows: 50,000
## Columns: 2
## $ review <chr> "One of the other reviewers has mentioned that after watchin…
## $ sentiment <chr> "positive", "positive", "positive", "negative", "positive", …
reviews %>%
count(sentiment) %>%
kable()
| sentiment | n |
|---|---|
| negative | 25000 |
| positive | 25000 |
In this section, I tokenize the review text into individual words, remove stopwords, and filter the remaining tokens by joining them with the Bing sentiment lexicon.
data("stop_words")
bing_lex <- get_sentiments("bing") %>%
rename(word_sentiment = sentiment)
tidy_reviews <- reviews %>%
mutate(id = row_number()) %>%
unnest_tokens(word, review) %>%
anti_join(stop_words, by = "word") %>%
inner_join(bing_lex, by = "word")
## Warning in inner_join(., bing_lex, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1213804 of `x` matches multiple rows in `y`.
## ℹ Row 5781 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
In this section, I conduct a sentiment analysis of IMDB movie reviews using the Bing lexicon and explore how positive and negative sentiment words are expressed in text.
The following three visualizations present the key findings from my analysis, each focusing on a different aspect of sentiment word usage.
I count the number of positive and negative words that appear in reviews labeled as “positive” or “negative”. This comparison allows me to see whether review sentiment correlates with word-level sentiment from the lexicon.
I chose a grouped bar chart (using
geom_col(position = "dodge")) because it clearly shows the
contrast between the frequency of positive and negative words across
different sentiment categories. The bars offer an immediate visual
comparison, and the use of color enhances readability. This design helps
convey the overall balance of sentiment expression in a concise and
accessible format.
word_counts <- tidy_reviews %>%
count(sentiment, word_sentiment)
plot1 <- ggplot(word_counts, aes(x = word_sentiment, y = n, fill = sentiment)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = comma) +
labs(title = "Positive vs Negative Words Used in Reviews",
x = "Lexicon Sentiment", y = "Word Count") +
theme_minimal()
ggsave("images/bing_sentiment_distribution.png", plot1, width = 7, height = 5)
plot1
Here, I identify the most common sentiment words that appear in each sentiment category. To enhance readability, I use a faceted bar chart that separates words by sentiment polarity. However, because a single word can appear in both positive and negative reviews, this stacked visual can be visually ambiguous. To improve clarity, I switch to a faceted dot plot that better highlights per-word sentiment frequency split.
I chose the dot plot because it reduces visual clutter compared to stacked or side-by-side bar charts, especially when the same word appears in both sentiment categories. Dots allow readers to clearly compare the relative frequency of each word by sentiment, while facets help organize the chart by word polarity. This design emphasizes clarity and readability when displaying overlapping linguistic features.
top_bing_words <- tidy_reviews %>%
count(sentiment, word, word_sentiment, sort = TRUE) %>%
group_by(sentiment, word_sentiment) %>%
slice_max(n, n = 7)
plot2 <- ggplot(top_bing_words, aes(x = n, y = reorder_within(word, n, word_sentiment), color = sentiment)) +
geom_point(size = 3) +
facet_wrap(~word_sentiment, scales = "free") +
scale_y_reordered() +
scale_x_continuous(labels = comma) +
labs(title = "Top Words by Bing Sentiment Type",
x = "Frequency", y = "Word") +
theme_minimal()
ggsave("images/bing_top_words_by_sentiment_type.png", plot2, width = 8, height = 6)
plot2
I chose to use Term Frequency-Inverse Document Frequency (TF-IDF) instead of a wordcloud to highlight not just frequently used words, but those that are most distinctive to each sentiment category. TF-IDF helps uncover nuanced vocabulary that sets positive and negative reviews apart, which a wordcloud—based purely on frequency—cannot capture as effectively. This choice also aligns with the goal of revealing deeper insights from textual patterns.
This section uses Term Frequency-Inverse Document Frequency (TF-IDF) to determine which words are not only frequent, but also uniquely associated with positive or negative reviews. This helps reveal the vocabulary that most strongly characterizes each sentiment class.
I designed the visualization as a faceted horizontal bar chart using
geom_col() with TF-IDF scores on the x-axis. The flipped
coordinates improve readability for long word labels. I used different
fill colors for each sentiment category to help distinguish the two
classes visually without a cluttered legend. The minimalist theme keeps
the viewer’s attention on the data rather than decorative elements. This
chart design effectively conveys truth by clearly surfacing the top
differentiating words across sentiment groups.
tfidf_words <- tidy_reviews %>%
count(sentiment, word) %>%
bind_tf_idf(word, sentiment, n) %>%
group_by(sentiment) %>%
slice_max(tf_idf, n = 10) %>%
ungroup()
plot3 <- ggplot(tfidf_words, aes(x = reorder_within(word, tf_idf, sentiment), y = tf_idf, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(labels = percent_format(accuracy = 0.001)) +
labs(title = "Top TF-IDF Words by Review Sentiment",
x = "Word", y = "TF-IDF") +
theme_minimal()
ggsave("images/bing_tfidf_by_sentiment.png", plot3, width = 7, height = 5)
plot3
Through a multi-step sentiment analysis using the Bing lexicon, I uncovered clear differences in the use of sentiment words across IMDB movie reviews labeled as positive and negative.
In the first stage, I found that negative reviews include significantly more negative words, while positive reviews are rich in positive expressions. However, the overlap—such as the presence of negative words in positive reviews—suggests that reviewers often express mixed or nuanced opinions, even within clearly labeled categories.
The top sentiment words analysis revealed that certain words, like bad, worst, and boring, are highly concentrated in negative reviews, while love, excellent, and beautiful dominate positive reviews. I used a faceted dot plot instead of a bar chart to better illustrate overlapping word usage across sentiment groups, enhancing visual clarity.
The TF-IDF analysis proved especially valuable in surfacing distinctive terms not merely based on frequency, but on their uniqueness within each sentiment group. Words like uncreative, wretchedly, and harrow stood out in negative reviews, while openness, jubilant, and contentment characterized positive ones. This reflects how strong sentiment is often expressed through emotionally charged, context-specific vocabulary.
Overall, this project demonstrates how lexicon-based sentiment analysis and TF-IDF weighting can work in tandem to uncover both broad patterns and subtle differences in textual data. These findings can inform applications in automated review classification, customer opinion mining, and emotional tone detection in digital media.