This report replicates the sentiment analysis in Chapter 2 of Silge and Robinson’s (2017) Text Mining with R, then expands on the method by analyzing New York Times-style headlines with both AFINN and Bing sentiment lexicons.
library(tidytext)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stringr)
library(tidyr)
library(janeaustenr) # for original example
library(textdata) # for additional lexicons
library(readr)
library(tibble)
original_books <- austen_books() %>%
group_by(book) %>%
mutate(line_number = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()
# Tokenization
tidy_books <-original_books |>
unnest_tokens(word, text)
bing <-get_sentiments("bing")
bing_sentiment <- tidy_books %>%
inner_join(bing, by = "word")
## Warning in inner_join(., bing, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435443 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# count sentiment overtime
bing_sentiment_count <- bing_sentiment %>%
count(book, index = line_number %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment_score = positive - negative)
ggplot(bing_sentiment_count, aes(index, sentiment_score, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, scales = "free_x") +
labs(title = "Sentiment Trajectory of Jane Austen’s Books")
## Extension working with a difference corpus
nyt_headlines <- tribble(
~headline,
"Economy shows strong growth amid optimism",
"Devastating floods leave thousands homeless",
"Stock market crashes sparking investor fear",
"Healthcare reform brings relief to families",
"Biden promises hopeful future for workers",
"Rising crime fuels public anger and concern"
)
tidy_nyt <- nyt_headlines %>%
unnest_tokens(word, headline) %>%
anti_join(stop_words, by = "word")
afinn <- get_sentiments("afinn")
nyt_afinn <- tidy_nyt %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(index = row_number() %/% 2) %>%
group_by(index) %>%
summarise(sentiment = sum(value))
print(nyt_afinn)
## # A tibble: 6 × 2
## index sentiment
## <dbl> <dbl>
## 1 0 2
## 2 1 4
## 3 2 -3
## 4 3 -1
## 5 4 -1
## 6 5 -3
ggplot(nyt_afinn, aes(index, sentiment)) +
geom_col(fill = "purple") +
labs(title = "AFINN Sentiment in NYT Headlines",
x = "Headline Group", y = "Sentiment Score")
bing <- get_sentiments("bing")
bing_sentiment <- tidy_nyt |>
inner_join(bing, by = "word")|>
count(sentiment)
ggplot(bing_sentiment, aes(x = sentiment, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
labs(title = "Sentiment in NYT Headlines (Bing)",
x = "Sentiment Category", y = "Word Count")
This investigation shows how sentiment lexicons such as AFINN and Bing may give information about the emotional tone of news headlines. AFINN measures numerical intensity, whereas Bing categorizes emotions into binary categories. Future studies might incorporate real-time data via APIs or the NRC lexicon to provide a larger emotional palette.
Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media, Inc.
Tidytext package: https://github.com/juliasilge/tidytext