Ethical Web Scraping

using data about books

Author

Carson Carrico

library(tidyverse)
library(ggplot2)
library(readr)

Question

For this assignment, I wanted to look at data on books. Reading is one of my favorite hobbies outside of class, so I wanted to obtain some extra insight in the field of books. After some quick research, I found a website that had information on books, and openly encouraged web-scraping. Using this data, I hope to answer the question on whether this is a relationship between a book’s price, ratings, and availability.

Data Collection

Data was collected for this analysis from the Books to Scrape website. I read on an R Reddit page that this website was very beginner-friendly and actively invites scraping noobs all the time, so I found it to be a perfect fit. I scraped data from 50 pages of the website, creating a loop that pulled the book’s title, price, rating, availability, and the product URL. The scraped script was saved as a separate R script, and the dataset was saved on my academic OneDrive.

books <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/carricoc2_xavier_edu/ERn_s-GCWQ5PmO7Jfhfm2A8BA7k2aSM6vle-ctyLmd5A7w?download=1")

Data Wrangling

The necessary steps to ensure that the data used in this analysis can help answer our proposed questions has been done, so the next steps are dedicated to analyzing said dataset

books_clean <- books %>%
  mutate(
    availability_status = if_else(str_detect(availability, "In stock"), "In stock", "Out of stock"),
    rating = as.integer(rating) 
  ) %>%
  filter(!is.na(rating) & !is.na(price))

summary(books_clean)

    title               price           rating      availability      
 Length:1000        Min.   :10.00   Min.   :1.000   Length:1000       
 Class :character   1st Qu.:22.11   1st Qu.:2.000   Class :character  
 Mode  :character   Median :35.98   Median :3.000   Mode  :character  
                    Mean   :35.07   Mean   :2.923                     
                    3rd Qu.:47.46   3rd Qu.:4.000                     
                    Max.   :59.99   Max.   :5.000                     
   book_url         availability_status
 Length:1000        Length:1000        
 Class :character   Class :character   
 Mode  :character   Mode  :character

Analysis

The first type of analysis that we will do for this assignment is looking at the distribution of book ratings across all observations in the data.

books_clean %>%
  ggplot(aes(x = factor(rating))) +
  geom_bar(fill = "purple") +
  labs(title = "Number of Books by Star Rating", x = "Star Rating", y = "Count")

Looking at the results, we see some pretty even distribution between each star rating. Surprisingly, 1 star reviews are the most popular. This could be attributed to the fact that if people dedicate time to reading an entire book and do not like it, then their review could be harsher than normal.

The next analysis I want to do is to look at another distribution, but this time seeing how prices of books are distributed by rating:

books_clean %>%
  ggplot(aes(x = factor(rating), y = price)) +
  geom_boxplot(fill = "blue") +
  labs(title = "Book Price by Star Rating", x = "Star Rating", y = "Price (£)")

After looking at the output in this analysis, we can see that 3 star rated books are, on average, the cheapest of all the star ratings. Also note that 4 star rated books are the most expensive.

Continuing with our analysis, i want to now examine the relationship between rating in price:

ggplot(books_clean, aes(x = rating, y = price)) +
  geom_jitter(width = 0.2, alpha = 0.5, color = "darkred") +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "Does Rating Correlate with Price?", x = "Rating", y = "Price (£)")

Looking at the results from this graph, I can assert that there is not a significant relationship between price and rating. Normally, we would expect price to increase as star rating increases. This is because the higher quality of work a book is, the more expensive it should be. However, this only shows a slight increase in price, and one that I was expecting to be much steeper.

Four our fourth analysis, I want to look at the availability of a book based on how well it is rated:

books_clean %>%
  group_by(rating, availability_status) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = factor(rating), y = count, fill = availability_status)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Availability by Book Rating", x = "Rating", y = "Count")

After looking at this histogram, I see that books with 1 star rating have the most out of stock books. I attribiue this to one primary factor. Typically, when a book is not well received, the publisher of that book will slow down the printing of it. This means that books are leaving the shelves faster than they are being produced, which causes more frequent instances of being out of stock.

For our final analysis using this dataset, I want to look into the top 10 most expensive books on this website:

books_clean %>%
  arrange(desc(price)) %>%
  select(title, price, rating) %>%
  head(10)

# A tibble: 10 × 3
   title                                                            price rating
   <chr>                                                            <dbl>  <int>
 1 The Perfect Play (Play by Play #1)                                60.0      3
 2 Last One Home (New Beginnings #1)                                 60.0      3
 3 Civilization and Its Discontents                                  60.0      2
 4 The Barefoot Contessa Cookbook                                    59.9      5
 5 The Diary of a Young Girl                                         59.9      3
 6 The Bone Hunters (Lexy Vaughan & Steven Macaulay #2)              59.7      3
 7 Thomas Jefferson and the Tripoli Pirates: The Forgotten War Tha…  59.6      1
 8 Boar Island (Anna Pigeon #19)                                     59.5      3
 9 The Improbability of Love                                         59.4      1
10 The Man Who Mistook His Wife for a Hat and Other Clinical Tales   59.4      4

Looking at this output, not only do I see books that I have never heard of, but they span over a wide range of genres. I wanted to analyze this in an attempt to see if one genre was more expensive than the rest, but I actually see the opposite, where genre might not play a factor in the price of a book.

Conclusion

To conclude this analysis, there are several key takeaways to be made:

The majority of books have a rating of 3 or 4 stars.
There is no strong linear correlation between price and rating.
Availability is high across all ratings, though more highly rated books appear slightly more available.
The most expensive books span a range of genres, indicating price is likely influenced more by rating or novelty than genre, but that is yet to be proven from this data.

--- # This is a YAML # YAML is a human-friendly data serialization language # We use it to define the criteria of the Quarto Document. title: "Ethical Web Scraping" # Name of your HTML output subtitle: "using data about books" author: "Carson Carrico" # Author name toc: true # Generates an automatic table of contents. format: # Options related to formatting. html: # Options related to HTML output. code-tools: TRUE # Allow the code tools option showing in the output. embed-resources: TRUE # Embeds all components into a single HTML file. execute: # Options related to the execution of code chunks. warning: FALSE # FALSE: Code chunk sarnings are hidden by default. message: FALSE # FALSE: Code chunk messages are hidden by default. echo: TRUE # TRUE: Show all code in the output. # There are many other YAML functions available. # You can view execution options for code chunks here: # https://quarto.org/docs/computations/execution-options.html # View more formatting options here: # https://quarto.org/docs/reference/formats/html.html --- ```{r} library(tidyverse) library(ggplot2) library(readr) ``` ## Question For this assignment, I wanted to look at data on books. Reading is one of my favorite hobbies outside of class, so I wanted to obtain some extra insight in the field of books. After some quick research, I found a website that had information on books, and openly encouraged web-scraping. Using this data, I hope to answer the question on whether this is a relationship between a book's price, ratings, and availability. ## Data Collection Data was collected for this analysis from the Books to Scrape website. I read on an R Reddit page that this website was very beginner-friendly and actively invites scraping noobs all the time, so I found it to be a perfect fit. I scraped data from 50 pages of the website, creating a loop that pulled the book's title, price, rating, availability, and the product URL. The scraped script was saved as a separate R script, and the dataset was saved on my academic OneDrive. ```{r} books <- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/carricoc2_xavier_edu/ERn_s-GCWQ5PmO7Jfhfm2A8BA7k2aSM6vle-ctyLmd5A7w?download=1") ``` ## Data Wrangling The necessary steps to ensure that the data used in this analysis can help answer our proposed questions has been done, so the next steps are dedicated to analyzing said dataset ```{r} books_clean <- books %>% mutate( availability_status = if_else(str_detect(availability, "In stock"), "In stock", "Out of stock"), rating = as.integer(rating) ) %>% filter(!is.na(rating) & !is.na(price)) summary(books_clean) ``` ## Analysis The first type of analysis that we will do for this assignment is looking at the distribution of book ratings across all observations in the data. ```{r} books_clean %>% ggplot(aes(x = factor(rating))) + geom_bar(fill = "purple") + labs(title = "Number of Books by Star Rating", x = "Star Rating", y = "Count") ``` Looking at the results, we see some pretty even distribution between each star rating. Surprisingly, 1 star reviews are the most popular. This could be attributed to the fact that if people dedicate time to reading an entire book and do not like it, then their review could be harsher than normal. The next analysis I want to do is to look at another distribution, but this time seeing how prices of books are distributed by rating: ```{r} books_clean %>% ggplot(aes(x = factor(rating), y = price)) + geom_boxplot(fill = "blue") + labs(title = "Book Price by Star Rating", x = "Star Rating", y = "Price (£)") ``` After looking at the output in this analysis, we can see that 3 star rated books are, on average, the cheapest of all the star ratings. Also note that 4 star rated books are the most expensive. Continuing with our analysis, i want to now examine the relationship between rating in price: ```{r} ggplot(books_clean, aes(x = rating, y = price)) + geom_jitter(width = 0.2, alpha = 0.5, color = "darkred") + geom_smooth(method = "lm", se = FALSE, color = "green") + labs(title = "Does Rating Correlate with Price?", x = "Rating", y = "Price (£)") ``` Looking at the results from this graph, I can assert that there is not a significant relationship between price and rating. Normally, we would expect price to increase as star rating increases. This is because the higher quality of work a book is, the more expensive it should be. However, this only shows a slight increase in price, and one that I was expecting to be much steeper. Four our fourth analysis, I want to look at the availability of a book based on how well it is rated: ```{r} books_clean %>% group_by(rating, availability_status) %>% summarise(count = n()) %>% ggplot(aes(x = factor(rating), y = count, fill = availability_status)) + geom_bar(stat = "identity", position = "dodge") + labs(title = "Availability by Book Rating", x = "Rating", y = "Count") ``` After looking at this histogram, I see that books with 1 star rating have the most out of stock books. I attribiue this to one primary factor. Typically, when a book is not well received, the publisher of that book will slow down the printing of it. This means that books are leaving the shelves faster than they are being produced, which causes more frequent instances of being out of stock. For our final analysis using this dataset, I want to look into the top 10 most expensive books on this website: ```{r} books_clean %>% arrange(desc(price)) %>% select(title, price, rating) %>% head(10) ``` Looking at this output, not only do I see books that I have never heard of, but they span over a wide range of genres. I wanted to analyze this in an attempt to see if one genre was more expensive than the rest, but I actually see the opposite, where genre might not play a factor in the price of a book. ## Conclusion To conclude this analysis, there are several key takeaways to be made: - The majority of books have a rating of 3 or 4 stars. - There is **no strong linear correlation** between price and rating. - Availability is high across all ratings, though more highly rated books appear slightly more available. - The most expensive books span a range of genres, indicating price is likely influenced more by rating or novelty than genre, but that is yet to be proven from this data.