Introduction

This document demonstrates sentiment analysis using tidy data principles. The primary example code is based on Chapter 2 of “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson【N:cmhhademz01cwoe0fwgxngx3b】.

Setup

First, we load the necessary libraries and get the sentiment lexicons.

Example Code

Here we work with Jane Austen’s novels and perform sentiment analysis using the bing lexicon.

Extending the Example

Different Corpus

Next, we work with a different corpus. Replacing Gödel, Escher, Bach extracted separately from PDF.

# Assumed text extraction from PDF—process similar to plain text if already extracted.
godel_escher_bach <- readLines("GEB_extracted.txt")

# Check and remove any invalid UTF-8 characters
godel_escher_bach <- iconv(godel_escher_bach, from = "UTF-8", to = "UTF-8", sub = "")

# Create a tibble
godel_escher_bach_df <- tibble(line = 1:length(godel_escher_bach), text = godel_escher_bach)

# Unnest tokens
tidy_godel_escher_bach <- godel_escher_bach_df %>%
  unnest_tokens(word, text)

# Print a sample of the processed data
print(head(tidy_godel_escher_bach))
## # A tibble: 6 × 2
##    line word         
##   <int> <chr>        
## 1     8 contents     
## 2    11 overview     
## 3    11 viii         
## 4    12 list         
## 5    12 of           
## 6    12 illustrations
geb_sentiment <- tidy_godel_escher_bach %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)
## Joining with `by = join_by(word)`
geb_sentiment
## # A tibble: 1,955 × 3
##    word         sentiment     n
##    <chr>        <chr>     <int>
##  1 like         positive    499
##  2 well         positive    499
##  3 right        positive    247
##  4 problem      negative    224
##  5 intelligence positive    199
##  6 strange      negative    163
##  7 complex      negative    144
##  8 free         positive    139
##  9 good         positive    138
## 10 work         positive    127
## # ℹ 1,945 more rows

Additional Sentiment Lexicon

Incorporate an additional sentiment lexicon from another R package, such as sentimentr.

library(sentimentr)

# Load and clean the text data
godel_escher_bach <- readLines("GEB_extracted.txt", encoding = "UTF-8")
godel_escher_bach <- iconv(godel_escher_bach, from = "UTF-8", to = "UTF-8", sub = "")
godel_escher_bach <- trimws(godel_escher_bach)
godel_escher_bach <- godel_escher_bach[godel_escher_bach != ""] 

# Create tibble/Dataframe from cleaned text
godel_escher_bach_df <- tibble(line = 1:length(godel_escher_bach), text = godel_escher_bach)

# Segment the text into sentences
sentence_list <- get_sentences(godel_escher_bach_df$text)

# Flatten and ensure consistency with sentiment analysis
flattened_sentences <- unlist(sentence_list, recursive = FALSE)

# Perform sentiment analysis
sentences_data <- sentiment_by(flattened_sentences)
## Warning: Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
## raw `character` vector is passed to `text.var`. This may be costly of time and
## memory.  It is highly recommended that the user first runs the raw `character`
## vector through the `get_sentences` function.
# Sample 10 random sentences
sample_ids <- sample(sentences_data$element_id, 10)

# Map and extract the sentences corresponding to the sampled element_ids
sample_sentences <- sentences_data[sentences_data$element_id %in% sample_ids, ]
samples_with_text <- cbind(sample_sentences, text = flattened_sentences[sample_ids])

# Print each sentence individually
cat("## Sample Sentences from Gödel, Escher, Bach\n\n")
## ## Sample Sentences from Gödel, Escher, Bach
for(i in 1:nrow(samples_with_text)) {
  cat(sprintf("**Sentence %d (ID: %d):**\n", i, samples_with_text$element_id[i]))
  cat(samples_with_text$text[i], "\n\n")
}
## **Sentence 1 (ID: 1262):**
## (5) If two lines are drawn which intersect a third in such a way that the sum of the inner angles on one side is less than two right angles, then the two lines inevitably must intersect each other on that side if extended far enough 
## 
## **Sentence 2 (ID: 2518):**
## Meyer, can. 
## 
## **Sentence 3 (ID: 4323):**
## Repression is trickier. 
## 
## **Sentence 4 (ID: 4858):**
## In my own hypothetical brain model, conscious awareness does get representation as a very real causal agent and rates an important place in the causal sequence and chain of control in brain events, in which it appears as an active, operational force.... 
## 
## **Sentence 5 (ID: 9402):**
## But for the moment, suppose that issue had been solved. 
## 
## **Sentence 6 (ID: 14121):**
## "And your present position is that you accept A and B, but you DON'T accept the Hypothetical-" 
## 
## **Sentence 7 (ID: 14462):**
## Achilles: Oh, please do! 
## 
## **Sentence 8 (ID: 18024):**
## The standard scientific explanation for this is that ESP is a nonreal phenomenon which cannot stand up to rigorous scrutiny. 
## 
## **Sentence 9 (ID: 18371):**
## In fact, the pulling-out may inv such complicated operations that it makes you feel you are putting in n information than you are pulling out. 
## 
## **Sentence 10 (ID: 20062):**
## Words and Symbols
# Display the statistics table at the end
cat("## Sentiment Analysis Statistics Table\n\n")
## ## Sentiment Analysis Statistics Table
knitr::kable(samples_with_text[, c("element_id", "word_count", "sd", "ave_sentiment")], 
             col.names = c("Element ID", "Word Count", "SD", "Average Sentiment"),
             caption = "Sentiment Statistics for Sample Sentences")
Sentiment Statistics for Sample Sentences
Element ID Word Count SD Average Sentiment
1262 17 NA -0.2425356
2518 45 NA 0.0074536
4323 26 NA -0.0196116
4858 3 NA 0.0000000
9402 10 NA -0.0632456
14121 3 NA -0.2886751
14462 4 NA 0.5000000
18024 20 NA 0.2012461
18371 42 NA 0.3086067
20062 2 NA 0.0000000

Conclusion

This assignment extended the analysis by applying sentiment analysis to a different corpus, GEB, and incorporating an additional sentiment lexicon beyond those discussed in Chapter 2 of the source text.

References

Hofstadter, D. R. (1979). Gödel, Escher, Bach: An Eternal Golden Braid. Basic Books.

Electronic version retrieved from https://welib.org/slow_download/06372a47c23884e660e534a0cc872f9b/0/0 (Accessed: November 02, 2025)

Silge, J., & Robinson, D. (2025). Text Mining with R: A Tidy Approach.