library(tidyverse)
library(ggplot2)
library(readr)Ethical Web Scraping
using data about books
Question
For this assignment, I wanted to look at data on books. Reading is one of my favorite hobbies outside of class, so I wanted to obtain some extra insight in the field of books. After some quick research, I found a website that had information on books, and openly encouraged web-scraping. Using this data, I hope to answer the question on whether this is a relationship between a book’s price, ratings, and availability.
Data Collection
Data was collected for this analysis from the Books to Scrape website. I read on an R Reddit page that this website was very beginner-friendly and actively invites scraping noobs all the time, so I found it to be a perfect fit. I scraped data from 50 pages of the website, creating a loop that pulled the book’s title, price, rating, availability, and the product URL. The scraped script was saved as a separate R script, and the dataset was saved on my academic OneDrive.
books <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/carricoc2_xavier_edu/ERn_s-GCWQ5PmO7Jfhfm2A8BA7k2aSM6vle-ctyLmd5A7w?download=1")Data Wrangling
The necessary steps to ensure that the data used in this analysis can help answer our proposed questions has been done, so the next steps are dedicated to analyzing said dataset
books_clean <- books %>%
mutate(
availability_status = if_else(str_detect(availability, "In stock"), "In stock", "Out of stock"),
rating = as.integer(rating)
) %>%
filter(!is.na(rating) & !is.na(price))
summary(books_clean) title price rating availability
Length:1000 Min. :10.00 Min. :1.000 Length:1000
Class :character 1st Qu.:22.11 1st Qu.:2.000 Class :character
Mode :character Median :35.98 Median :3.000 Mode :character
Mean :35.07 Mean :2.923
3rd Qu.:47.46 3rd Qu.:4.000
Max. :59.99 Max. :5.000
book_url availability_status
Length:1000 Length:1000
Class :character Class :character
Mode :character Mode :character
Analysis
The first type of analysis that we will do for this assignment is looking at the distribution of book ratings across all observations in the data.
books_clean %>%
ggplot(aes(x = factor(rating))) +
geom_bar(fill = "purple") +
labs(title = "Number of Books by Star Rating", x = "Star Rating", y = "Count")Looking at the results, we see some pretty even distribution between each star rating. Surprisingly, 1 star reviews are the most popular. This could be attributed to the fact that if people dedicate time to reading an entire book and do not like it, then their review could be harsher than normal.
The next analysis I want to do is to look at another distribution, but this time seeing how prices of books are distributed by rating:
books_clean %>%
ggplot(aes(x = factor(rating), y = price)) +
geom_boxplot(fill = "blue") +
labs(title = "Book Price by Star Rating", x = "Star Rating", y = "Price (£)")After looking at the output in this analysis, we can see that 3 star rated books are, on average, the cheapest of all the star ratings. Also note that 4 star rated books are the most expensive.
Continuing with our analysis, i want to now examine the relationship between rating in price:
ggplot(books_clean, aes(x = rating, y = price)) +
geom_jitter(width = 0.2, alpha = 0.5, color = "darkred") +
geom_smooth(method = "lm", se = FALSE, color = "green") +
labs(title = "Does Rating Correlate with Price?", x = "Rating", y = "Price (£)")Looking at the results from this graph, I can assert that there is not a significant relationship between price and rating. Normally, we would expect price to increase as star rating increases. This is because the higher quality of work a book is, the more expensive it should be. However, this only shows a slight increase in price, and one that I was expecting to be much steeper.
Four our fourth analysis, I want to look at the availability of a book based on how well it is rated:
books_clean %>%
group_by(rating, availability_status) %>%
summarise(count = n()) %>%
ggplot(aes(x = factor(rating), y = count, fill = availability_status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Availability by Book Rating", x = "Rating", y = "Count")After looking at this histogram, I see that books with 1 star rating have the most out of stock books. I attribiue this to one primary factor. Typically, when a book is not well received, the publisher of that book will slow down the printing of it. This means that books are leaving the shelves faster than they are being produced, which causes more frequent instances of being out of stock.
For our final analysis using this dataset, I want to look into the top 10 most expensive books on this website:
books_clean %>%
arrange(desc(price)) %>%
select(title, price, rating) %>%
head(10)# A tibble: 10 × 3
title price rating
<chr> <dbl> <int>
1 The Perfect Play (Play by Play #1) 60.0 3
2 Last One Home (New Beginnings #1) 60.0 3
3 Civilization and Its Discontents 60.0 2
4 The Barefoot Contessa Cookbook 59.9 5
5 The Diary of a Young Girl 59.9 3
6 The Bone Hunters (Lexy Vaughan & Steven Macaulay #2) 59.7 3
7 Thomas Jefferson and the Tripoli Pirates: The Forgotten War Tha… 59.6 1
8 Boar Island (Anna Pigeon #19) 59.5 3
9 The Improbability of Love 59.4 1
10 The Man Who Mistook His Wife for a Hat and Other Clinical Tales 59.4 4
Looking at this output, not only do I see books that I have never heard of, but they span over a wide range of genres. I wanted to analyze this in an attempt to see if one genre was more expensive than the rest, but I actually see the opposite, where genre might not play a factor in the price of a book.
Conclusion
To conclude this analysis, there are several key takeaways to be made:
The majority of books have a rating of 3 or 4 stars.
There is no strong linear correlation between price and rating.
Availability is high across all ratings, though more highly rated books appear slightly more available.
The most expensive books span a range of genres, indicating price is likely influenced more by rating or novelty than genre, but that is yet to be proven from this data.