This document demonstrates sentiment analysis using tidy data principles. The primary example code is based on Chapter 2 of “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson【N:cmhhademz01cwoe0fwgxngx3b】.
First, we load the necessary libraries and get the sentiment lexicons.
Here we work with Jane Austen’s novels and perform sentiment analysis using the bing lexicon.
Next, we work with a different corpus. Replacing Gödel, Escher, Bach extracted separately from PDF.
# Assumed text extraction from PDF—process similar to plain text if already extracted.
godel_escher_bach <- readLines("GEB_extracted.txt")
# Check and remove any invalid UTF-8 characters
godel_escher_bach <- iconv(godel_escher_bach, from = "UTF-8", to = "UTF-8", sub = "")
# Create a tibble
godel_escher_bach_df <- tibble(line = 1:length(godel_escher_bach), text = godel_escher_bach)
# Unnest tokens
tidy_godel_escher_bach <- godel_escher_bach_df %>%
unnest_tokens(word, text)
# Print a sample of the processed data
print(head(tidy_godel_escher_bach))## # A tibble: 6 × 2
## line word
## <int> <chr>
## 1 8 contents
## 2 11 overview
## 3 11 viii
## 4 12 list
## 5 12 of
## 6 12 illustrations
geb_sentiment <- tidy_godel_escher_bach %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)## Joining with `by = join_by(word)`
## # A tibble: 1,955 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 like positive 499
## 2 well positive 499
## 3 right positive 247
## 4 problem negative 224
## 5 intelligence positive 199
## 6 strange negative 163
## 7 complex negative 144
## 8 free positive 139
## 9 good positive 138
## 10 work positive 127
## # ℹ 1,945 more rows
library(sentimentr)
# Load and clean the text data
godel_escher_bach <- readLines("GEB_extracted.txt", encoding = "UTF-8")
godel_escher_bach <- iconv(godel_escher_bach, from = "UTF-8", to = "UTF-8", sub = "")
godel_escher_bach <- trimws(godel_escher_bach)
godel_escher_bach <- godel_escher_bach[godel_escher_bach != ""]
# Create tibble/Dataframe from cleaned text
godel_escher_bach_df <- tibble(line = 1:length(godel_escher_bach), text = godel_escher_bach)
# Segment the text into sentences
sentence_list <- get_sentences(godel_escher_bach_df$text)
# Flatten and ensure consistency with sentiment analysis
flattened_sentences <- unlist(sentence_list, recursive = FALSE)
# Perform sentiment analysis
sentences_data <- sentiment_by(flattened_sentences)## Warning: Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
## raw `character` vector is passed to `text.var`. This may be costly of time and
## memory. It is highly recommended that the user first runs the raw `character`
## vector through the `get_sentences` function.
# Sample 10 random sentences
sample_ids <- sample(sentences_data$element_id, 10)
# Map and extract the sentences corresponding to the sampled element_ids
sample_sentences <- sentences_data[sentences_data$element_id %in% sample_ids, ]
samples_with_text <- cbind(sample_sentences, text = flattened_sentences[sample_ids])
# Print each sentence individually
cat("## Sample Sentences from Gödel, Escher, Bach\n\n")## ## Sample Sentences from Gödel, Escher, Bach
for(i in 1:nrow(samples_with_text)) {
cat(sprintf("**Sentence %d (ID: %d):**\n", i, samples_with_text$element_id[i]))
cat(samples_with_text$text[i], "\n\n")
}## **Sentence 1 (ID: 1262):**
## (5) If two lines are drawn which intersect a third in such a way that the sum of the inner angles on one side is less than two right angles, then the two lines inevitably must intersect each other on that side if extended far enough
##
## **Sentence 2 (ID: 2518):**
## Meyer, can.
##
## **Sentence 3 (ID: 4323):**
## Repression is trickier.
##
## **Sentence 4 (ID: 4858):**
## In my own hypothetical brain model, conscious awareness does get representation as a very real causal agent and rates an important place in the causal sequence and chain of control in brain events, in which it appears as an active, operational force....
##
## **Sentence 5 (ID: 9402):**
## But for the moment, suppose that issue had been solved.
##
## **Sentence 6 (ID: 14121):**
## "And your present position is that you accept A and B, but you DON'T accept the Hypothetical-"
##
## **Sentence 7 (ID: 14462):**
## Achilles: Oh, please do!
##
## **Sentence 8 (ID: 18024):**
## The standard scientific explanation for this is that ESP is a nonreal phenomenon which cannot stand up to rigorous scrutiny.
##
## **Sentence 9 (ID: 18371):**
## In fact, the pulling-out may inv such complicated operations that it makes you feel you are putting in n information than you are pulling out.
##
## **Sentence 10 (ID: 20062):**
## Words and Symbols
## ## Sentiment Analysis Statistics Table
knitr::kable(samples_with_text[, c("element_id", "word_count", "sd", "ave_sentiment")],
col.names = c("Element ID", "Word Count", "SD", "Average Sentiment"),
caption = "Sentiment Statistics for Sample Sentences")| Element ID | Word Count | SD | Average Sentiment |
|---|---|---|---|
| 1262 | 17 | NA | -0.2425356 |
| 2518 | 45 | NA | 0.0074536 |
| 4323 | 26 | NA | -0.0196116 |
| 4858 | 3 | NA | 0.0000000 |
| 9402 | 10 | NA | -0.0632456 |
| 14121 | 3 | NA | -0.2886751 |
| 14462 | 4 | NA | 0.5000000 |
| 18024 | 20 | NA | 0.2012461 |
| 18371 | 42 | NA | 0.3086067 |
| 20062 | 2 | NA | 0.0000000 |
This assignment extended the analysis by applying sentiment analysis to a different corpus, GEB, and incorporating an additional sentiment lexicon beyond those discussed in Chapter 2 of the source text.
Hofstadter, D. R. (1979). Gödel, Escher, Bach: An Eternal Golden Braid. Basic Books.
Silge, J., & Robinson, D. (2025). Text Mining with R: A Tidy Approach.