This assignment reproduces the sentiment analysis example from Chapter 2 of “Text Mining with R” by Julia Silge and David Robinson (2017), and extends it by: 1. Using a different corpus, 2. Adding another sentiment lexicon.
Citation: Silge, J. & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/sentiment.html
# 1. BASE EXAMPLE (Jane Austen books from the janeaustenr package)
# Load and annotate Austen text with line numbers and chapters
austenBooks <- austen_books() %>%
  group_by(book) %>%
  mutate(
    lineNumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))
  ) %>%
  ungroup()
# Use tidytext stop words
data("stop_words")
# Tokenize (one word per row), remove stop words
tidyAusten <- austenBooks %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")
# Get the "bing" sentiment lexicon (positive / negative labels)
bingLex <- get_sentiments("bing")
# Join tokens with sentiment, then summarize sentiment over the narrative
austenSentiment <- tidyAusten %>%
  inner_join(bingLex, by = "word") %>%
  count(book, index = lineNumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentimentScore = positive - negative)
## Warning in inner_join(., bingLex, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 131015 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
# Plot sentiment trajectory by book
ggplot(austenSentiment, aes(index, sentimentScore, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ book, scales = "free_x") +
  labs(
    title = "Sentiment Over Time (Jane Austen Books)",
    subtitle = "Positive - Negative word counts using the Bing lexicon",
    x = "Narrative Index (~80 lines)",
    y = "Sentiment Score"
  )
# 2. NEW CORPUS
# Read your own text data as one big document
myCorpus <- tibble(
  docId = 1,
  text = read_file("my_corpus.txt")
)
# Tokenize your corpus and remove stop words
myCorpusTidy <- myCorpus %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")
# Peek at first few tokens from your corpus
head(myCorpusTidy)
## # A tibble: 6 × 2
##   docId word     
##   <dbl> <chr>    
## 1     1 fantastic
## 2     1 day      
## 3     1 sun      
## 4     1 shining  
## 5     1 positive 
## 6     1 energy
# 3. SENTIMENT ANALYSIS ON YOUR CORPUS
#
# We'll use two lexicons:
# - "bing": positive / negative
# - "afinn": numeric score (negative to positive)
bingLex <- get_sentiments("bing")
afinnLex <- get_sentiments("afinn")
# Bing summary: how many positive vs negative words?
bingSummary <- myCorpusTidy %>%
  inner_join(bingLex, by = "word") %>%
  count(sentiment) %>%
  mutate(percent = n / sum(n) * 100)
# AFINN summary: numeric sentiment score stats
afinnSummary <- myCorpusTidy %>%
  inner_join(afinnLex, by = "word") %>%
  summarize(
    meanScore = mean(value),
    totalScore = sum(value),
    wordCountMatched = n()
  )
# Show both summaries as nice tables
kable(bingSummary, caption = "Positive vs Negative (Bing Lexicon)")
| sentiment | n | percent | 
|---|---|---|
| negative | 5 | 50 | 
| positive | 5 | 50 | 
kable(afinnSummary, caption = "AFINN Sentiment Summary")
| meanScore | totalScore | wordCountMatched | 
|---|---|---|
| 0.4545455 | 5 | 11 | 
# 4. SENTIMENT TREND OVER THE DOCUMENT
#
# We'll approximate "time" by token position and then look at sentiment score
# in bins of 50 tokens.
timeline <- myCorpusTidy %>%
  mutate(position = row_number()) %>%
  inner_join(afinnLex, by = "word") %>%
  group_by(bin = position %/% 50) %>%
  summarize(avgSentiment = mean(value), .groups = "drop")
ggplot(timeline, aes(x = bin, y = avgSentiment)) +
  geom_line(linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Sentiment Trend (AFINN)",
    x = "Token Bin (50-word groups)",
    y = "Average Sentiment Score"
  )
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# 5. EXTRA LEXICON: NRC EMOTION CATEGORIES
#
# The NRC lexicon labels words with emotions like joy, fear, anger, trust, etc.
# This goes beyond positive/negative and gives you richer interpretation.
nrcLex <- get_sentiments("nrc")
nrcSummary <- myCorpusTidy %>%
  inner_join(nrcLex, by = "word") %>%
  count(sentiment, sort = TRUE)
kable(nrcSummary, caption = "Emotion Distribution (NRC Lexicon)")
| sentiment | n | 
|---|---|
| anticipation | 8 | 
| joy | 7 | 
| positive | 7 | 
| surprise | 5 | 
| anger | 3 | 
| disgust | 3 | 
| negative | 3 | 
| sadness | 3 | 
| trust | 3 | 
| fear | 2 | 
The Jane Austen example above follows the workflow in Silge & Robinson (2017), Chapter 2 of ‘Text Mining with R’. I tokenized text, removed stop words, joined each token to a sentiment lexicon (bing), and then plotted sentiment over the narrative.
I then applied the same pipeline to my own corpus (my_corpus.txt). This satisfies the requirement to work with a different corpus.
I added two sentiment lexicons beyond the basic positive/negative view: AFINN (numeric polarity scores) and NRC (emotion categories like joy, trust, anger). This satisfies the requirement to incorporate at least one additional sentiment lexicon beyond the one in the book.