Base example citation: Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach, Chapter 2 Sentiment Analysis. https://www.tidytextmining.com/sentiment.html
On first use, some sentiment lexicons (e.g., AFINN
and NRC) are distributed via the textdata
downloader, which prompts the user to accept a license/terms. During
knit on Posit Cloud, knitting runs in a
non‑interactive session, so those prompts cannot be answered
and knitting fails.
To avoid this, this report compares bing (from
tidytext) with Jockers–Rinker (from the
lexicon package), which ships locally and does
not require any interactive download. This satisfies
the assignment requirement to incorporate an additional lexicon from
another package.
This reproduces the Ch. 2 workflow on Pride & Prejudice only to keep memory tiny on Posit Cloud.
austen <- austen_books() %>%
dplyr::filter(book == "Pride & Prejudice") %>%
dplyr::mutate(
linenumber = dplyr::row_number(),
# NOTE: double backslash \\d so R passes \d to the regex engine
chapter = cumsum(stringr::str_detect(text, stringr::regex("^chapter\\s+\\d+", ignore_case = TRUE)))
)
austen_words <- austen %>%
tidytext::unnest_tokens(word, text) %>%
dplyr::anti_join(stop_words, by = "word")
bing <- tidytext::get_sentiments("bing")
austen_net <- austen_words %>%
dplyr::inner_join(bing, by = "word") %>%
dplyr::count(index = linenumber %/% 60, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
dplyr::mutate(net = positive - negative)
ggplot(austen_net, aes(index, net)) +
geom_col() +
labs(title = "Pride & Prejudice — net sentiment (bing)",
x = "Narrative index (per 60 lines)", y = "Positive − Negative") +
theme_minimal(base_size = 12)
## 2) Extension A — Different corpus: Poe’s “The Raven”
(Gutenberg 1065)
poe <- gutenbergr::gutenberg_download(1065, mirror = "https://gutenberg.org") %>%
dplyr::mutate(linenumber = dplyr::row_number())
poe_words <- poe %>%
tidytext::unnest_tokens(word, text) %>%
dplyr::anti_join(stop_words, by = "word")
# quick check
poe_words %>% dplyr::count(word, sort = TRUE) %>% head()
raven_bing <- poe_words %>%
dplyr::inner_join(bing, by = "word") %>%
dplyr::count(index = linenumber %/% 35, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
dplyr::mutate(net = positive - negative)
ggplot(raven_bing, aes(index, net)) +
geom_col() +
labs(title = 'Poe — "The Raven": net sentiment (bing)',
x = "Narrative index (per 35 lines)", y = "Positive − Negative") +
theme_minimal(base_size = 12)
## 3) Extension B — Extra lexicon from another package (Jockers–Rinker)
& comparison
Adds Jockers–Rinker from lexicon and
compares against bing. All weights are coerced to
numeric to avoid sum() errors.
# bing → ±1 (no download required)
lex_bing <- tidytext::get_sentiments("bing") %>%
dplyr::transmute(word, weight = dplyr::if_else(sentiment == "positive", 1, -1)) %>%
dplyr::mutate(weight = as.numeric(weight))
# Jockers–Rinker (robust coercion across package versions) — comes with `lexicon`
jr_obj <- lexicon::hash_sentiment_jockers_rinker
jr_tbl <- if (is.data.frame(jr_obj) || is.matrix(jr_obj)) {
tibble::as_tibble(jr_obj, rownames = "word")
} else if (is.atomic(jr_obj) && !is.null(names(jr_obj))) {
tibble::tibble(word = names(jr_obj), value = as.numeric(jr_obj))
} else stop("Unexpected structure for lexicon::hash_sentiment_jockers_rinker")
num_cols <- jr_tbl %>% dplyr::select(where(is.numeric))
stopifnot(ncol(num_cols) >= 1)
score_col <- names(num_cols)[1]
lex_jr <- jr_tbl %>%
dplyr::transmute(word = .data[["word"]], weight = as.numeric(.data[[score_col]])) %>%
dplyr::filter(!is.na(word), !is.na(weight))
# safety checks
stopifnot(is.numeric(lex_bing$weight), is.numeric(lex_jr$weight))
score_by_lex <- function(words_tbl, lex_tbl) {
words_tbl %>%
dplyr::inner_join(lex_tbl, by = "word") %>%
dplyr::mutate(index = linenumber %/% 35) %>%
dplyr::group_by(index) %>%
dplyr::summarise(score = sum(weight, na.rm = TRUE), .groups = "drop")
}
raven_scores <- dplyr::bind_rows(
score_by_lex(poe_words, lex_bing) %>% dplyr::mutate(lexicon = "bing"),
score_by_lex(poe_words, lex_jr) %>% dplyr::mutate(lexicon = "jockers_rinker")
)
raven_scores %>%
ggplot2::ggplot(ggplot2::aes(index, score)) +
ggplot2::geom_line() +
ggplot2::facet_wrap(~ lexicon, ncol = 2, scales = "free_y") +
ggplot2::labs(title = 'Poe — "The Raven": sentiment by lexicon (bing vs. Jockers–Rinker)',
x = "Narrative index (per 35 lines)", y = "Score") +
ggplot2::theme_minimal(base_size = 12)
## 4) Short conclusions
lexicon →
Jockers–Rinker).Findings (Austen) — Using bing on Pride & Prejudice, the net-sentiment bar series oscillates around neutral with several positive surges early and a mild down-trend near the middle. Because bing is a ±1 dictionary, the magnitude primarily reflects how often positive vs. negative tokens appear per block of lines, not intensity. This matches the novel’s tone shifts during dialogue-heavy sections and conflict set-ups.
Findings (Poe) — For The Raven, both lexicons show a slide toward negativity as the poem progresses, with troughs when “nevermore,” “darkness,” and “dreary” cluster. Jockers–Rinker tracks the turning points but produces smoother curves than bing because JR has graded weights, not just ±1. This reduces spiky swings when a few frequent tokens dominate (e.g., “nevermore”).
Takeaway — Agreement on the timing of peaks/troughs suggests construct validity across lexicons; differences in amplitude reflect weighting schemes (counting polarity vs. weighted polarity).
tidytext, janeaustenr,
gutenbergr, lexicon, dplyr,
tidyr, ggplot2, stringr,
knitr (see Session info).lexicon::hash_sentiment_jockers_rinker).sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lexicon_1.2.1 gutenbergr_0.3.0 janeaustenr_1.0.0 tidytext_0.4.3
## [5] stringr_1.5.2 ggplot2_4.0.0 tidyr_1.3.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 stringi_1.8.7 lattice_0.22-7
## [5] hms_1.1.4 digest_0.6.37 magrittr_2.0.4 evaluate_1.0.5
## [9] grid_4.5.1 RColorBrewer_1.1-3 fastmap_1.2.0 jsonlite_2.0.0
## [13] Matrix_1.7-3 syuzhet_1.0.7 purrr_1.1.0 scales_1.4.0
## [17] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.5 crayon_1.5.3
## [21] rlang_1.1.6 tokenizers_0.3.0 bit64_4.6.0-1 withr_3.0.2
## [25] cachem_1.1.0 yaml_2.3.10 readMDTable_0.3.2 parallel_4.5.1
## [29] tools_4.5.1 tzdb_0.5.0 vctrs_0.6.5 R6_2.6.1
## [33] lifecycle_1.0.4 bit_4.6.0 vroom_1.6.6 pkgconfig_2.0.3
## [37] pillar_1.11.1 bslib_0.9.0 gtable_0.3.6 glue_1.8.0
## [41] data.table_1.17.8 Rcpp_1.1.0 xfun_0.54 tibble_3.3.0
## [45] tidyselect_1.2.1 knitr_1.50 farver_2.1.2 htmltools_0.5.8.1
## [49] SnowballC_0.7.1 rmarkdown_2.30 labeling_0.4.3 readr_2.1.5
## [53] compiler_4.5.1 S7_0.2.0