Warning: package 'ggplot2' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate) library(tidytext)
Warning: package 'tidytext' was built under R version 4.4.3
library(textdata)
Warning: package 'textdata' was built under R version 4.4.3
library(widyr)
Warning: package 'widyr' was built under R version 4.4.3
library(igraph)
Warning: package 'igraph' was built under R version 4.4.3
Attaching package: 'igraph'
The following objects are masked from 'package:lubridate':
%--%, union
The following objects are masked from 'package:dplyr':
as_data_frame, groups, union
The following objects are masked from 'package:purrr':
compose, simplify
The following object is masked from 'package:tidyr':
crossing
The following object is masked from 'package:tibble':
as_data_frame
The following objects are masked from 'package:stats':
decompose, spectrum
The following object is masked from 'package:base':
union
library(ggraph)
Warning: package 'ggraph' was built under R version 4.4.3
Rows: 60 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): book_id, reviewer_name, review_content, review_date
dbl (4): reviewer_id, reviewer_followers, reviewer_total_reviews, review_rating
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
How do Harry Potter and Percy Jackson reviews compare?
I want to see which gateway book receives more positive and negative reviews. These books are some of the most important stepping stones into fantasy for younger readers. They were some of my favorites as a child. As a slightly more grown child, I have my own opinions as to which held up the test of time. however we are here to see the peoples consensus. I gathered data from Goodreads using the Goodreader package. We have 30 reviews which happens to be the current maximum for the package due to stricter limitations by Goodreads on the amount of data that can be pulled at a given time.
`summarise()` has grouped output by 'book_id'. You can override using the
`.groups` argument.
head(num_words_by_book, n =20)
# A tibble: 20 × 3
# Groups: book_id [2]
book_id word n
<chr> <chr> <int>
1 Harry Potter harry 135
2 Harry Potter book 103
3 Harry Potter br 94
4 Harry Potter potter 81
5 Harry Potter read 78
6 Percy Jackson book 65
7 Harry Potter books 62
8 Percy Jackson percy 57
9 Percy Jackson read 45
10 Harry Potter ron 28
11 Harry Potter series 28
12 Harry Potter time 28
13 Percy Jackson greek 28
14 Percy Jackson reading 28
15 Harry Potter reading 27
16 Harry Potter world 27
17 Harry Potter hermione 26
18 Harry Potter 2 25
19 Harry Potter school 25
20 Percy Jackson series 25
important_num_words_by_book <- num_words_by_book %>%filter(n >20)ggplot(important_num_words_by_book, aes(x =reorder_within(word, n, book_id), y = n, fill = book_id)) +geom_col(show.legend =FALSE) +facet_wrap(~book_id, scales ="free") +theme(axis.text.x =element_text(angle =45, hjust =1)) +scale_x_reordered()
As we can see by this example there are a bunch of words in the review that have no bearing on the actual sentiment value. So I will get rid of all of the words that are of no value and be back to you with a little movie magic.
num_words_by_book_minus_words <- num_words_by_book[-c(1,2,3,4, 5,6,7,8,9,11,12,15,17,19,25,21,26,29,32,38,42,43,47,48,50,51,68,73,74,75,76,77,79), ]important_num_words_by_book <- num_words_by_book_minus_words %>%filter(n >10)ggplot(important_num_words_by_book, aes(x =reorder_within(word, n, book_id), y = n, fill = book_id)) +geom_col(show.legend =FALSE) +facet_wrap(~book_id, scales ="free") +theme(axis.text.x =element_text(angle =45, hjust =1)) +scale_x_reordered()
There doesnt that look a lot better. We have Percy Jackson on the left and Harry Potter on the right, now we can leap into doing a little bit of actual sentiment analysis.
`summarise()` has grouped output by 'book_id'. You can override using the
`.groups` argument.
# A tibble: 4 × 3
# Groups: book_id [2]
book_id sentiment n
<chr> <chr> <int>
1 Harry Potter positive 8
2 Percy Jackson positive 7
3 Harry Potter negative 3
4 Percy Jackson negative 3
Looking at the sentiment Comparison we can see that Harry Potter has a higher aggregate of positive sentiment than Percy Jackson at least in the sample that I obtained the end result was a positive 50:39 Harry to Percy, this is after switching funny from negative to positive sentiment.
Conclusion
As a gateway into reading fantasy Harry Potter is more highly regarded among goodreads reviewers. I would personally dispute the fact but I have a younger sister in the middle of her Harry Potter phase, so there is definitely some bias. Both of these books perform exceptionally well in terms of having strong positive sentiment especially within the small sample sizes. If we were to have compared some more controversial or less beginner friendly novels I think that the race could have been more interesting especially if I were to look at some classics where the possibility of negitave sentiment stands in contrast to historic performance.