Assignment#10A

Approach

The approach for this assignment Is to obtain the primary code and insert it into the QMD File. Determine a different text corpus and insert a lexicon and understand what a lexicon is. I don’t fully understand what is being asked for a sentiment lexicon.

Code Base

Reproduce the Base Example

library(tidytext)
library(janeaustenr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(stringr)
library(tidyr)
library(ggplot2)
library(gutenbergr)
library(syuzhet)
library(textdata)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text,
                                regex("^chapter [\\divxlc]",
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

# Joy words in Emma
nrc_joy <- get_sentiments("nrc") %>% filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

Joining with `by = join_by(word)`

# A tibble: 301 × 2
   word          n
   <chr>     <int>
 1 good        359
 2 friend      166
 3 hope        143
 4 happy       125
 5 love        117
 6 deal         92
 7 found        92
 8 present      89
 9 kind         82
10 happiness    76
# ℹ 291 more rows

# Sentiment across novels
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ggplot(aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

Joining with `by = join_by(word)`

Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Extend the Analysis –> Different Corpus, Lincoln Speech and Syuzhet sentiment lexicon

lincoln_raw <- gutenberg_download(2657)

Using mirror https://aleph.pglaf.org.

tidy_lincoln <- lincoln_raw %>%
  mutate(linenumber = row_number()) %>%
  unnest_tokens(word, text)

# Bing on Lincoln
tidy_lincoln %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ggplot(aes(index, sentiment)) +
  geom_col(fill = "steelblue") +
  labs(title = "Lincoln's Speeches Sentiment (Bing)")

Joining with `by = join_by(word)`

# Syuzhet on Lincoln (additional lexicon)
tidy_lincoln %>%
  mutate(sentiment_score = get_sentiment(word, method = "syuzhet")) %>%
  group_by(index = linenumber %/% 80) %>%
  summarise(sentiment = sum(sentiment_score)) %>%
  ggplot(aes(index, sentiment)) +
  geom_col(fill = "forestgreen") +
  labs(title = "Lincoln's Speeches Sentiment (Syuzhet)")

Conclusion

We reproduced the sentiment analysis from Chapter 2 which was on Jane Austen’s novels and applied these techniques to Abraham Lincolns speech with a different lexicon from “syuzhet” package.

Some takeaways are the difference in sentiment analysis. Austen’s novels trend more positive. Lincolns speech is heavier and more negative reflecting the nature of the speech. All lexicons captured similar relative sentiment trajectories.