This assignment reproduces the sentiment analysis example from Chapter 2 of “Text Mining with R” by Julia Silge and David Robinson (2017), and extends it by: 1. Using a different corpus, 2. Adding another sentiment lexicon.
Citation: Silge, J. & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/sentiment.html
# 1. BASE EXAMPLE (Jane Austen books from the janeaustenr package)
# Load and annotate Austen text with line numbers and chapters
austenBooks <- austen_books() %>%
group_by(book) %>%
mutate(
lineNumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))
) %>%
ungroup()
# Use tidytext stop words
data("stop_words")
# Tokenize (one word per row), remove stop words
tidyAusten <- austenBooks %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
# Get the "bing" sentiment lexicon (positive / negative labels)
bingLex <- get_sentiments("bing")
# Join tokens with sentiment, then summarize sentiment over the narrative
austenSentiment <- tidyAusten %>%
inner_join(bingLex, by = "word") %>%
count(book, index = lineNumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentimentScore = positive - negative)
## Warning in inner_join(., bingLex, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 131015 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Plot sentiment trajectory by book
ggplot(austenSentiment, aes(index, sentimentScore, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ book, scales = "free_x") +
labs(
title = "Sentiment Over Time (Jane Austen Books)",
subtitle = "Positive - Negative word counts using the Bing lexicon",
x = "Narrative Index (~80 lines)",
y = "Sentiment Score"
)
# 2. NEW CORPUS
# Read your own text data as one big document
myCorpus <- tibble(
docId = 1,
text = read_file("my_corpus.txt")
)
# Tokenize your corpus and remove stop words
myCorpusTidy <- myCorpus %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
# Peek at first few tokens from your corpus
head(myCorpusTidy)
## # A tibble: 6 × 2
## docId word
## <dbl> <chr>
## 1 1 fantastic
## 2 1 day
## 3 1 sun
## 4 1 shining
## 5 1 positive
## 6 1 energy
# 3. SENTIMENT ANALYSIS ON YOUR CORPUS
#
# We'll use two lexicons:
# - "bing": positive / negative
# - "afinn": numeric score (negative to positive)
bingLex <- get_sentiments("bing")
afinnLex <- get_sentiments("afinn")
# Bing summary: how many positive vs negative words?
bingSummary <- myCorpusTidy %>%
inner_join(bingLex, by = "word") %>%
count(sentiment) %>%
mutate(percent = n / sum(n) * 100)
# AFINN summary: numeric sentiment score stats
afinnSummary <- myCorpusTidy %>%
inner_join(afinnLex, by = "word") %>%
summarize(
meanScore = mean(value),
totalScore = sum(value),
wordCountMatched = n()
)
# Show both summaries as nice tables
kable(bingSummary, caption = "Positive vs Negative (Bing Lexicon)")
| sentiment | n | percent |
|---|---|---|
| negative | 5 | 50 |
| positive | 5 | 50 |
kable(afinnSummary, caption = "AFINN Sentiment Summary")
| meanScore | totalScore | wordCountMatched |
|---|---|---|
| 0.4545455 | 5 | 11 |
# 4. SENTIMENT TREND OVER THE DOCUMENT
#
# We'll approximate "time" by token position and then look at sentiment score
# in bins of 50 tokens.
timeline <- myCorpusTidy %>%
mutate(position = row_number()) %>%
inner_join(afinnLex, by = "word") %>%
group_by(bin = position %/% 50) %>%
summarize(avgSentiment = mean(value), .groups = "drop")
ggplot(timeline, aes(x = bin, y = avgSentiment)) +
geom_line(linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(
title = "Sentiment Trend (AFINN)",
x = "Token Bin (50-word groups)",
y = "Average Sentiment Score"
)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# 5. EXTRA LEXICON: NRC EMOTION CATEGORIES
#
# The NRC lexicon labels words with emotions like joy, fear, anger, trust, etc.
# This goes beyond positive/negative and gives you richer interpretation.
nrcLex <- get_sentiments("nrc")
nrcSummary <- myCorpusTidy %>%
inner_join(nrcLex, by = "word") %>%
count(sentiment, sort = TRUE)
kable(nrcSummary, caption = "Emotion Distribution (NRC Lexicon)")
| sentiment | n |
|---|---|
| anticipation | 8 |
| joy | 7 |
| positive | 7 |
| surprise | 5 |
| anger | 3 |
| disgust | 3 |
| negative | 3 |
| sadness | 3 |
| trust | 3 |
| fear | 2 |
The Jane Austen example above follows the workflow in Silge & Robinson (2017), Chapter 2 of ‘Text Mining with R’. I tokenized text, removed stop words, joined each token to a sentiment lexicon (bing), and then plotted sentiment over the narrative.
I then applied the same pipeline to my own corpus (my_corpus.txt). This satisfies the requirement to work with a different corpus.
I added two sentiment lexicons beyond the basic positive/negative view: AFINN (numeric polarity scores) and NRC (emotion categories like joy, trust, anger). This satisfies the requirement to incorporate at least one additional sentiment lexicon beyond the one in the book.