DATA 607 — Sentiment Analysis (Ch. 2 + Extensions, Bing + Jockers

Base example citation: Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach, Chapter 2 Sentiment Analysis. https://www.tidytextmining.com/sentiment.html

0.1 Note on lexicons & non‑interactive installs

On first use, some sentiment lexicons (e.g., AFINN and NRC) are distributed via the textdata downloader, which prompts the user to accept a license/terms. During knit on Posit Cloud, knitting runs in a non‑interactive session, so those prompts cannot be answered and knitting fails.
To avoid this, this report compares bing (from tidytext) with Jockers–Rinker (from the lexicon package), which ships locally and does not require any interactive download. This satisfies the assignment requirement to incorporate an additional lexicon from another package.

1 Setup (no interactive downloads)

2 1) Chapter 2 reproduction (Austen, minimal subset)

This reproduces the Ch. 2 workflow on Pride & Prejudice only to keep memory tiny on Posit Cloud.

austen <- austen_books() %>%
  dplyr::filter(book == "Pride & Prejudice") %>%
  dplyr::mutate(
    linenumber = dplyr::row_number(),
    # NOTE: double backslash \\d so R passes \d to the regex engine
    chapter = cumsum(stringr::str_detect(text, stringr::regex("^chapter\\s+\\d+", ignore_case = TRUE)))
  )

austen_words <- austen %>%
  tidytext::unnest_tokens(word, text) %>%
  dplyr::anti_join(stop_words, by = "word")

bing <- tidytext::get_sentiments("bing")

austen_net <- austen_words %>%
  dplyr::inner_join(bing, by = "word") %>%
  dplyr::count(index = linenumber %/% 60, sentiment) %>%
  tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  dplyr::mutate(net = positive - negative)

ggplot(austen_net, aes(index, net)) +
  geom_col() +
  labs(title = "Pride & Prejudice — net sentiment (bing)",
       x = "Narrative index (per 60 lines)", y = "Positive − Negative") +
  theme_minimal(base_size = 12)

## 2) Extension A — Different corpus: Poe’s “The Raven” (Gutenberg 1065)

poe <- gutenbergr::gutenberg_download(1065, mirror = "https://gutenberg.org") %>%
  dplyr::mutate(linenumber = dplyr::row_number())

poe_words <- poe %>%
  tidytext::unnest_tokens(word, text) %>%
  dplyr::anti_join(stop_words, by = "word")

# quick check
poe_words %>% dplyr::count(word, sort = TRUE) %>% head()

raven_bing <- poe_words %>%
  dplyr::inner_join(bing, by = "word") %>%
  dplyr::count(index = linenumber %/% 35, sentiment) %>%
  tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  dplyr::mutate(net = positive - negative)

ggplot(raven_bing, aes(index, net)) +
  geom_col() +
  labs(title = 'Poe — "The Raven": net sentiment (bing)',
       x = "Narrative index (per 35 lines)", y = "Positive − Negative") +
  theme_minimal(base_size = 12)

## 3) Extension B — Extra lexicon from another package (Jockers–Rinker) & comparison

Adds Jockers–Rinker from lexicon and compares against bing. All weights are coerced to numeric to avoid sum() errors.

# bing → ±1 (no download required)
lex_bing <- tidytext::get_sentiments("bing") %>%
  dplyr::transmute(word, weight = dplyr::if_else(sentiment == "positive", 1, -1)) %>%
  dplyr::mutate(weight = as.numeric(weight))

# Jockers–Rinker (robust coercion across package versions) — comes with `lexicon`
jr_obj <- lexicon::hash_sentiment_jockers_rinker
jr_tbl <- if (is.data.frame(jr_obj) || is.matrix(jr_obj)) {
  tibble::as_tibble(jr_obj, rownames = "word")
} else if (is.atomic(jr_obj) && !is.null(names(jr_obj))) {
  tibble::tibble(word = names(jr_obj), value = as.numeric(jr_obj))
} else stop("Unexpected structure for lexicon::hash_sentiment_jockers_rinker")

num_cols <- jr_tbl %>% dplyr::select(where(is.numeric))
stopifnot(ncol(num_cols) >= 1)
score_col <- names(num_cols)[1]

lex_jr <- jr_tbl %>%
  dplyr::transmute(word = .data[["word"]], weight = as.numeric(.data[[score_col]])) %>%
  dplyr::filter(!is.na(word), !is.na(weight))

# safety checks
stopifnot(is.numeric(lex_bing$weight), is.numeric(lex_jr$weight))

score_by_lex <- function(words_tbl, lex_tbl) {
  words_tbl %>%
    dplyr::inner_join(lex_tbl, by = "word") %>%
    dplyr::mutate(index = linenumber %/% 35) %>%
    dplyr::group_by(index) %>%
    dplyr::summarise(score = sum(weight, na.rm = TRUE), .groups = "drop")
}

raven_scores <- dplyr::bind_rows(
  score_by_lex(poe_words, lex_bing) %>% dplyr::mutate(lexicon = "bing"),
  score_by_lex(poe_words, lex_jr)   %>% dplyr::mutate(lexicon = "jockers_rinker")
)

raven_scores %>%
  ggplot2::ggplot(ggplot2::aes(index, score)) +
  ggplot2::geom_line() +
  ggplot2::facet_wrap(~ lexicon, ncol = 2, scales = "free_y") +
  ggplot2::labs(title = 'Poe — "The Raven": sentiment by lexicon (bing vs. Jockers–Rinker)',
                x = "Narrative index (per 35 lines)", y = "Score") +
  ggplot2::theme_minimal(base_size = 12)

## 4) Short conclusions

✔ Reproduced Chapter 2 (Austen subset).
✔ Different corpus (Poe’s The Raven).
✔ Additional lexicon from another package (lexicon → Jockers–Rinker).
Observation: Jockers–Rinker generally agrees with bing on turning points, with smoother swings due to its weighting.

Interpretation: Austen (bing)

Findings (Austen) — Using bing on Pride & Prejudice, the net-sentiment bar series oscillates around neutral with several positive surges early and a mild down-trend near the middle. Because bing is a ±1 dictionary, the magnitude primarily reflects how often positive vs. negative tokens appear per block of lines, not intensity. This matches the novel’s tone shifts during dialogue-heavy sections and conflict set-ups.

Interpretation: “The Raven” (bing vs. Jockers–Rinker)

Findings (Poe) — For The Raven, both lexicons show a slide toward negativity as the poem progresses, with troughs when “nevermore,” “darkness,” and “dreary” cluster. Jockers–Rinker tracks the turning points but produces smoother curves than bing because JR has graded weights, not just ±1. This reduces spiky swings when a few frequent tokens dominate (e.g., “nevermore”).

Takeaway — Agreement on the timing of peaks/troughs suggests construct validity across lexicons; differences in amplitude reflect weighting schemes (counting polarity vs. weighted polarity).

References (tiny) ## References

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach, Ch. 2. https://www.tidytextmining.com/sentiment.html
R packages used: tidytext, janeaustenr, gutenbergr, lexicon, dplyr, tidyr, ggplot2, stringr, knitr (see Session info).
Jockers, M. L., & Rinker, T. W. Sentiment dictionary (via lexicon::hash_sentiment_jockers_rinker).

3 5) Session info

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.9.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] lexicon_1.2.1     gutenbergr_0.3.0  janeaustenr_1.0.0 tidytext_0.4.3   
## [5] stringr_1.5.2     ggplot2_4.0.0     tidyr_1.3.1       dplyr_1.1.4      
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     stringi_1.8.7      lattice_0.22-7    
##  [5] hms_1.1.4          digest_0.6.37      magrittr_2.0.4     evaluate_1.0.5    
##  [9] grid_4.5.1         RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0    
## [13] Matrix_1.7-3       syuzhet_1.0.7      purrr_1.1.0        scales_1.4.0      
## [17] codetools_0.2-20   jquerylib_0.1.4    cli_3.6.5          crayon_1.5.3      
## [21] rlang_1.1.6        tokenizers_0.3.0   bit64_4.6.0-1      withr_3.0.2       
## [25] cachem_1.1.0       yaml_2.3.10        readMDTable_0.3.2  parallel_4.5.1    
## [29] tools_4.5.1        tzdb_0.5.0         vctrs_0.6.5        R6_2.6.1          
## [33] lifecycle_1.0.4    bit_4.6.0          vroom_1.6.6        pkgconfig_2.0.3   
## [37] pillar_1.11.1      bslib_0.9.0        gtable_0.3.6       glue_1.8.0        
## [41] data.table_1.17.8  Rcpp_1.1.0         xfun_0.54          tibble_3.3.0      
## [45] tidyselect_1.2.1   knitr_1.50         farver_2.1.2       htmltools_0.5.8.1 
## [49] SnowballC_0.7.1    rmarkdown_2.30     labeling_0.4.3     readr_2.1.5       
## [53] compiler_4.5.1     S7_0.2.0

DATA 607 — Sentiment Analysis (Ch. 2 + Extensions, Bing + Jockers–Rinker)

Sachi Kapoor

2025-11-02

0.1 Note on lexicons & non‑interactive installs

1 Setup (no interactive downloads)

2 1) Chapter 2 reproduction (Austen, minimal subset)

3 5) Session info