Assignment_8

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate) 
library(tidytext)

Warning: package 'tidytext' was built under R version 4.4.3

library(textdata)

Warning: package 'textdata' was built under R version 4.4.3

library(widyr)

Warning: package 'widyr' was built under R version 4.4.3

library(igraph)

Warning: package 'igraph' was built under R version 4.4.3


Attaching package: 'igraph'

The following objects are masked from 'package:lubridate':

    %--%, union

The following objects are masked from 'package:dplyr':

    as_data_frame, groups, union

The following objects are masked from 'package:purrr':

    compose, simplify

The following object is masked from 'package:tidyr':

    crossing

The following object is masked from 'package:tibble':

    as_data_frame

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union

library(ggraph)

Warning: package 'ggraph' was built under R version 4.4.3

total_reviews_goodreads <- read_csv("total_reviews_goodreads.csv")

Rows: 60 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): book_id, reviewer_name, review_content, review_date
dbl (4): reviewer_id, reviewer_followers, reviewer_total_reviews, review_rating

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

How do Harry Potter and Percy Jackson reviews compare?

I want to see which gateway book receives more positive and negative reviews. These books are some of the most important stepping stones into fantasy for younger readers. They were some of my favorites as a child. As a slightly more grown child, I have my own opinions as to which held up the test of time. however we are here to see the peoples consensus. I gathered data from Goodreads using the Goodreader package. We have 30 reviews which happens to be the current maximum for the package due to stricter limitations by Goodreads on the amount of data that can be pulled at a given time.

total_reviews_goodreads_as_words <- 
  total_reviews_goodreads %>% 
  unnest_tokens(word, review_content) %>% 
  anti_join(stop_words)

Joining with `by = join_by(word)`

num_words_by_book <- total_reviews_goodreads_as_words %>% 
  group_by(book_id, word) %>% 
  summarize(n = n()) %>% 
  filter(n > 5) %>% 
  arrange(-n)

`summarise()` has grouped output by 'book_id'. You can override using the
`.groups` argument.

head(num_words_by_book, n = 20)

# A tibble: 20 × 3
# Groups:   book_id [2]
   book_id       word         n
   <chr>         <chr>    <int>
 1 Harry Potter  harry      135
 2 Harry Potter  book       103
 3 Harry Potter  br          94
 4 Harry Potter  potter      81
 5 Harry Potter  read        78
 6 Percy Jackson book        65
 7 Harry Potter  books       62
 8 Percy Jackson percy       57
 9 Percy Jackson read        45
10 Harry Potter  ron         28
11 Harry Potter  series      28
12 Harry Potter  time        28
13 Percy Jackson greek       28
14 Percy Jackson reading     28
15 Harry Potter  reading     27
16 Harry Potter  world       27
17 Harry Potter  hermione    26
18 Harry Potter  2           25
19 Harry Potter  school      25
20 Percy Jackson series      25

important_num_words_by_book <- num_words_by_book %>% filter(n >20)

ggplot(important_num_words_by_book, aes(x = reorder_within(word, n, book_id), y = n, fill = book_id)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book_id, scales = "free") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_reordered()

As we can see by this example there are a bunch of words in the review that have no bearing on the actual sentiment value. So I will get rid of all of the words that are of no value and be back to you with a little movie magic.

num_words_by_book_minus_words <- num_words_by_book[-c(1,2,3,4, 5,6,7,8,9,11,12,15,17,19,25,21,26,29,32,38,42,43,47,48,50,51,68,73,74,75,76,77,79), ]


important_num_words_by_book <- num_words_by_book_minus_words %>% filter(n >10)

ggplot(important_num_words_by_book, aes(x = reorder_within(word, n, book_id), y = n, fill = book_id)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book_id, scales = "free") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_reordered()

There doesnt that look a lot better. We have Percy Jackson on the left and Harry Potter on the right, now we can leap into doing a little bit of actual sentiment analysis.

bing <- get_sentiments("bing")

words_plus_polarity <- num_words_by_book_minus_words %>% 
  inner_join(bing, by = "word") 

words_plus_polarity %>%
  group_by(book_id,sentiment) %>% 
  summarize(n = n()) %>% 
  arrange(-n)

`summarise()` has grouped output by 'book_id'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
# Groups:   book_id [2]
  book_id       sentiment     n
  <chr>         <chr>     <int>
1 Harry Potter  positive      8
2 Percy Jackson positive      7
3 Harry Potter  negative      3
4 Percy Jackson negative      3

words_plus_polarity %>% 
  group_by(book_id) %>% 
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~book_id, ncol = 2)

Looking at the sentiment Comparison we can see that Harry Potter has a higher aggregate of positive sentiment than Percy Jackson at least in the sample that I obtained the end result was a positive 50:39 Harry to Percy, this is after switching funny from negative to positive sentiment.

Conclusion

As a gateway into reading fantasy Harry Potter is more highly regarded among goodreads reviewers. I would personally dispute the fact but I have a younger sister in the middle of her Harry Potter phase, so there is definitely some bias. Both of these books perform exceptionally well in terms of having strong positive sentiment especially within the small sample sizes. If we were to have compared some more controversial or less beginner friendly novels I think that the race could have been more interesting especially if I were to look at some classics where the possibility of negitave sentiment stands in contrast to historic performance.