Assignment 8

Question:

How do readers emotionally respond to Pride and Prejudice compared to The Let Them Theory based on Goodreads reviews?

This analysis explores whether classic and contemporary books have different emotional tines in reader reviews.

Data Collection

I scraped 5 pages of Goodreads reviews for each book using rvest in R. The reviews were saved in a CSV file with two column: review and book.

library(tidyverse) # All the tidy things
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate) # Easily fixing pesky dates
library(tidytext)  # Tidy text mining
library(textdata)  # Lexicons of sentiment data
library(widyr)     # Easily calculating pairwise counts
library(igraph)    # Special graphs for network analysis

Attaching package: 'igraph'

The following objects are masked from 'package:lubridate':

    %--%, union

The following objects are masked from 'package:dplyr':

    as_data_frame, groups, union

The following objects are masked from 'package:purrr':

    compose, simplify

The following object is masked from 'package:tidyr':

    crossing

The following object is masked from 'package:tibble':

    as_data_frame

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union
library(ggraph)    # An extension of ggplot for relational data
reviews <- 
  read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/krahs_xavier_edu/IQDxc2BcYZDqQYIaDQNgAkO-ARVgVq3gD9rC5_1V5NBO2cM?download=1")
Rows: 335 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): review, book

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Wrangling

I had to remove any blank reviews, then remove the stop words, and join with the NRC emotion lexicon.

clean_reviews <- 
  reviews %>% 
  filter(review != "")

nrc <- 
  get_sentiments("nrc")

tidy_reviews <- 
  clean_reviews %>% 
  unnest_tokens(word, review) %>% 
  anti_join(stop_words, by = "word") %>% 
  inner_join(nrc, by ="word")
Warning in inner_join(., nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 2801 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Which Emotions Dominate the Reviews for Each Book?

emotion_counts <- 
  tidy_reviews %>% 
  count(book, sentiment) %>% 
  group_by(book) %>% 
  mutate(percent = n / sum(n))

ggplot(emotion_counts, aes(x = sentiment, y = percent, fill = book)) + 
  geom_col(position = "dodge") +
  labs(title = "Emotional Tone in Goodreads Reviews",
       y = "Proportion of Emotion Words",
       x = "Emotion")

Pride and Prejudice reviews show higher proportions of joy, surprise, and positive reflecting its romantic and witty tone.

The Let Them Theory reviews show more sadness, fear, or anger, depending on the readers interpretation of its themes.

Which Words Appear Most in Reviews for Each Book? (Top 10)

top_words <- 
  clean_reviews %>% 
  unnest_tokens(word, review) %>% 
  anti_join(stop_words, by = "word") %>% 
  count(book, word, sort = TRUE) %>% 
  group_by(book) %>% 
  slice_max(n, n = 10) %>%   
  ungroup()
  
ggplot(top_words, aes(x = reorder_within(word, n, book), y = n, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, scales = "free") +
  scale_x_reordered()

  labs(title = "Top 10 Most Common Words in Reviews",
       x = "Word",
       y = "Frequency")
$x
[1] "Word"

$y
[1] "Frequency"

$title
[1] "Top 10 Most Common Words in Reviews"

attr(,"class")
[1] "labels"

This helps surface recurring themes or language in each book’s reviews. Pride and Prejudice highlights “love” while The Let Them Theory highlights “people” and “life”. Most of the words did have to do with either book characters or the author.

How Do Readers Emotionally Respond to Each Book?

bing <- get_sentiments("bing")

polarity <- reviews %>%
  unnest_tokens(word, review) %>%
  anti_join(stop_words, by = "word") %>%
  inner_join(bing, by = "word") %>%
  count(book, sentiment)

ggplot(polarity, aes(x = sentiment, y = n, fill = book)) +
  geom_col(position = "dodge") +
  labs(title = "Positive vs. Negative Sentiment in Reviews",
       x = "Sentiment",
       y = "Word Count") 

This chart shows the total number of positive and negative words found in Goodreads reviews for each book, based on the Bing sentiment lexicon. Pride and Prejudice has a higher total word count and positive sentiment dominates. The Let Them Theory has a lower total word count and negative sentiment slightly outweighs the positive, more for mixed reactions.

Conclusion

This analysis shows how reader language reflects emotional tone of each of the books. Pride and Prejudice stands out with a higher amount of reviews and positive sentiment, which correlate with its romantic narrative. The Let Them Theory has a bit more of complex emotional response, with a slight negative sentiment. It focuses on themes like “people” and “life”. The reviews also show more emotionally charged words like “sadness” or “fear” which might show that each reader engages with the text in a different way.