Assignment 7

Introduction

Goodreads is a go-to platform for book lovers to track what they’ve read, share reviews, and discover new favorites. One of its most popular features is the “Best Books Ever” list, which ranks books based on user ratings and votes. It’s a mix of classics, modern hits, and everything in between, giving a good snapshot of what readers around the world consider the best of the best.

Analysis

In this analysis, I looked at the books on the “Best Books Ever” list to explore trends in how readers rate and engage with them. Using R to scrape and visualize the data, I compared average ratings, the number of ratings, and which books were the most popular. The goal was to understand what makes a book highly rated or widely read—and whether the two always go hand in hand.

Data Source:
We’ll scrape data from Goodreads’ “Best Books Ever” ranking, which is publicly accessible and contains structured HTML content.

https://www.goodreads.com/list/show/1.Best_Books_Ever

Variables to Collect:

Book Title
Author
Average Rating
Number of Ratings
Book URL

Tools and Libraries:

rvest for web scraping
dplyr for data manipulation
stringr for string operations
readr for reading and writing data

Scraping Method:
Using R’s rvest and httr packages:

A custom scraping function will loop through multiple pages of results.
I will include a user_agent string to identify myself and use Sys.sleep(2) between requests to avoid overloading the server.

Why is this data suitable?

Publicly accessible
Provides multiple relevant variables
Allows meaningful grouping by genre

library(rvest)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(stringr)
library(readr)


Attaching package: 'readr'

The following object is masked from 'package:rvest':

    guess_encoding

library(ggplot2)
knitr::opts_chunk$set(echo = FALSE)

Visualization 1. Average Rating Distribution: Most vs. Least Rated Books

Explanation: This density plot compares the distribution of average ratings between books with more than 2 million ratings and those with fewer. It helps identify whether highly-rated books also tend to be widely rated, and shows differences in rating concentration across popularity levels.

Visualization 2. Ratings Distribution by Popularity (Top vs. Bottom 10 Rated Books)

Explanation: This boxplot contrasts the spread of average ratings for the top 10 most-rated books and the bottom 10 least-rated books. It highlights how rating behavior may differ at the extremes of popularity.

Visualization 3. Distribution of Rating Counts (Log Scale)

Explanation: This histogram shows the distribution of how many ratings books have received, using a log-scaled x-axis to accommodate the wide range. It helps reveal patterns among books with low, moderate, and extremely high numbers of ratings.

Visualization 4. Distribution of Average Ratings

Explanation: This histogram provides an overview of how average ratings are distributed across all books. It helps identify common rating scores and shows whether most books are rated similarly or vary widely.

Visualization 5. Top 10 Books by Ratings

Explanation: This bar chart highlights the 10 books with the most total ratings. It helps identify the most widely read and engaged-with books on the platform.