Assignment 6 - Scraping Data (goodreads)

Author

Katie Fuller

Assignment 6 - Scraping data ethically from a web source

What am I interested in and what question am I looking to answer?

The topic I am exploring today is the goodreads data from a book series I read in high school. This book series is still my favorite guilty pleasure read and every time I re-read it I am never disappointed. I have referred to the website goodreads in the past before reading books. I am curious if I had referred to The Selection Series goodreads website, if I still would have read it based on reviews. I’m also very curious about how others perceive the relative rankings of the books within the series.

How these questions will be answered:

I will be utilizing the goodreads pages for the two first books in The Selection Series : The Selection and The Elite. Goodreads is the world’s largest platform for reading and book recommendations. On Goodreads, each book has a dedicated page with details such as author information, ratings, comments, quotes, and recommendations.

I have scraped some of the reviews on each book’s pages and have hosted the scraped dataset (using a .CSV format) on OneDrive. I will be utilizing this .CSV to get a better understanding of people’s opinions on these books.

(Disclaimer!: The website only allowed me to scrape a singular page of the data (the goodreads website did not allow me to loop), so I only have 60 rows of data (30 reviews from The Selection & 30 reviews from The Elite). Therefore, this is not an accurate analysis of all of the goodreads reviews on the book’s page) (we talked during office hours & Lindsey let me know it was ok to still use)

Data Modifications:

  • I created a new row to modify the ‘review_rating’ column (the book rating between 1-5 stars) to the number it was given. For example, the new column ‘numeric_rating’ will contain ‘2’ when ‘review_rating’ contains ‘Rating 2 out of 5’.

Analysis:

Rating:

First I want to know if I would pick up this book after simply seeing the overall rating of the books. In order to do this, I to see the average 5-Star ratings of each book.

I have created a bar plot visualization that helps compare the average rating for The Selection and The Elite.

Interpretation: The visualization shows that the first book of the series (The Selection) has a slightly higher 5-Star book rating average than the second book of the series (The Elite) both around 2.75/5 stars.

Review Likes:

Next, I want to see which book’s reviews have more likes. My hypothesis is that if there are more likes, more people are liking positive reviews they agree with (because there is no dislike button on reviews). This may be an indicator that it is more liked. To find this, I am going to create a bar plot that shows the average review likes for each book.

Interpretation: As you can see from the bar plot, The Selection’s reviews have an average of 750 likes and The Elite’s reviews have an average of just under 400 likes. Although, more likes on reviews isn’t necessarily an indicator that reviews are more positive.

In order to see if having more likes on reviews tends to be more positive, I am going to be testing the relationship between the review’s 5-Star ratings and how many likes the rating has received. In order to show this relationship, I am creating a scatter plot to demonstrate is there is any relation to the reviewer likes and the reviewer ratings.

Interpretation: This visualization shows that reviews that are lower (primarily 1-star ratings) have more likes than higher rated reviews. This means that having more likes on a book’s reviews does not necessarily mean they are positive reviews, in this situation they are more likely to be negative.

Review Length:

Next, I want to see which book’s reviews are longer. My hypothesis is that people who have more to say about the book, must really like it. This may be an indicator that the book is more liked. To find this, I am going to create a bar plot that shows the average character count in each book’s reviews:

Interpretation: As you can see from the barplot, the first book of the series (The Selection around 3,750) seems to have quite a higher average character count for each review compared to the second book (The Elite around 3,000). Although, the longer the review isn’t necessarily an indicator that the reviews are more positive.

In order to see if longer reviews tend to be more positive, I have created a bar plot that shows the average number of characters in reviews by 5-star rating.

Interpretation: After looking at this bar plot, I can see that for the most part, lower reviews are longer based on character count. This means that having a longer review does not necessarily mean they are positive reviews, in this situation they are more likely to be negative.

Overall Interpretation:

From this small sample of the goodreads data, I am able to see that The Selection is slightly higher rated than The Elite. But based on further interpretation within the reviews, The Selection seems to have longer reviews with more likes, which indicates that they might have more scrutiny in the comments. There ultimately needs to be more data from the other goodreads reviews on the book’s pages in order to further analyze the data without possible outliers. I would re-read The Selection over The Elite after reading this analysis because of the overall ratings, but would need to look at more reviews to get a more accurate understanding of which book goodreads reviewers think is better.