Assignment 7 QMD

Quarto

For this assignment I needed a site that was straightforward to scrape,

and books.toscrape.com worked well since it's a practice site designed to be

scraped freely. Looking at the data, I noticed books are organized across 50

genre categories with prices and star ratings attached, which made me curious

whether genre has anything to do with how a book is priced or rated. That's

what I'm looking at here: do certain genres tend to cost more, and do ratings

vary across genres in any consistent way? It's a simple question but useful

in practice, like for a bookstore thinking about pricing or a

recommendation system that factors in genre.

The data comes from books.toscrape.com, which lists around 1,000 books across

50 genres. Each book has a title, price in GBP, a star rating from one to

five, and an availability status. The site is static HTML so basic rvest

scraping works without any extra tools. The script pulls the category list

from the homepage, then loops through each one and follows pagination links

until there are no more pages. That loop is really the point of scraping over

copying by hand since it handles all 50 categories automatically. I added a

short delay between requests and a user-agent string to keep things

clean and transparent. The output is a CSV with one row per book,

#which I uploaded to OneDrive and import into this document from there.

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.1.0     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
books_data <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/wasserstromo_xavier_edu/IQBBPm8_GI4_Qahac5qpchoOATMOss2e9gprRaUjOaAPn18?download=1")
Rows: 1000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): title, category, availability
dbl (3): price_gbp, rating, page_scraped_from

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Bar chart — average price by genre
books_data %>% 
  group_by(category) %>% 
  summarise(mean_price = mean(price_gbp)) %>% 
  ggplot(aes(x = reorder(category, mean_price), y = mean_price)) +
  geom_col(fill = "#2a9d8f") +
  coord_flip() +
  labs(title = "Average Price by Genre", x = NULL, y = "Mean Price (GBP)") +
  theme_minimal()

books_data %>% 
  ggplot(aes(x = reorder(category, rating, FUN = median), y = rating)) +
  geom_boxplot(fill = "#457b9d", alpha = 0.7) +
  coord_flip() +
  labs(title = "Star Rating Distribution by Genre", x = NULL, y = "Rating (1-5)") +
  theme_minimal()

Analysis & Results

The bar chart below shows the average price per genre and there is a noticeable spread across categories. Art and Food & Drink come in as the most expensive genres, averaging well above the others, while Children's and Poetry sit at the lower end. That pattern makes intuitive sense since specialty or illustrated books tend to be pricier regardless of content. This supports the idea that genre is at least somewhat tied to pricing, even if it's not the only factor.

The boxplot tells a different story when it comes to ratings. Most genres have a median rating right around 3 to 4 stars and the boxes are pretty similar in size across the board, meaning the spread within each genre isn't dramatically different either. Fantasy and Science Fiction nudge slightly higher while Self Help shows a bit more variation with some lower ratings pulling it down, but overall there's no genre that stands out as consistently better or worse rated than the rest.

Putting both visuals together, the data suggests that genre has a real relationship with price but not much of one with rating. You can't really use price as a signal for quality here since the more expensive genres aren't getting better reviews. For something like a pricing model or recommendation system, genre would be a useful input for estimating cost but a pretty weak one for predicting how well received a book will be.