[1] "A Light in the Attic"
[2] "Tipping the Velvet"
[3] "Soumission"
[4] "Sharp Objects"
[5] "Sapiens: A Brief History of Humankind"
[6] "The Requiem Red"
# A tibble: 6 × 5
title rating price stock image
<chr> <chr> <dbl> <chr> <chr>
1 A Light in the Attic Three 51.8 In stock media/cachttps://…
2 Tipping the Velvet One 53.7 In stock media/cachttps://…
3 Soumission One 50.1 In stock media/cachttps://…
4 Sharp Objects Four 47.8 In stock media/cachttps://…
5 Sapiens: A Brief History of Humankind Five 54.2 In stock media/cachttps://…
6 The Requiem Red One 22.6 In stock media/cachttps://…
Step 6: Paginate
# Empty list to collect each page's dataall_books <-list()for (i in1:50) {# Page 1 uses the root URL; pages 2+ use /catalogue/page-N.htmlif (i ==1) {url <-"https://books.toscrape.com"} else {url <-paste0("https://books.toscrape.com/catalogue/page-", i, ".html")}page <-read_html(url)books <-html_elements(page, "article.product_pod")all_books[[i]] <-tibble(title =html_attr(html_element(books, "h3 > a"), "title"),rating =gsub("star-rating ", "", html_attr(html_element(books, "p.star-rating"),"class")),price =html_text2(html_element(books, "p.price_color")),stock =trimws(html_text2(html_element(books, "p.availability"))),image =gsub("../../", "https://books.toscrape.com/",html_attr(html_element(books, "img.thumbnail"), "src")))# Be polite — pause briefly between requestsSys.sleep(0.5)cat("Scraped page", i, "\n")}
# A tibble: 6 × 5
title rating price stock image
<chr> <chr> <dbl> <chr> <chr>
1 A Light in the Attic Three 51.8 In stock media/cachttps://…
2 Tipping the Velvet One 53.7 In stock media/cachttps://…
3 Soumission One 50.1 In stock media/cachttps://…
4 Sharp Objects Four 47.8 In stock media/cachttps://…
5 Sapiens: A Brief History of Humankind Five 54.2 In stock media/cachttps://…
6 The Requiem Red One 22.6 In stock media/cachttps://…
Problem 1: Based on the scraped books_to_scrape data frame, create:
ggplot(full_df1, aes(x = index, y = price, color = price_category)) +geom_point(alpha =0.6) +geom_smooth(method ="lm", se =FALSE, linetype ="dotdash") +labs(title ="Book Prices Across Dataset by Category",x ="Book Index",y ="Price (£)",color ="Price Category",caption ="Data source: books.toscrape.com" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Problem 2: Create a bar graph
price_table <- full_df1 |>count(price_category)ggplot(price_table, aes(x = price_category, y = n, fill = price_category)) +geom_bar(stat ="identity") +labs(title ="Number of Books by Price Category",x ="Price Category",y ="Number of Books",caption ="Data source: books.toscrape.com" ) +theme_classic() +theme(legend.position ="none")
Problem 3: Create one more plot of your own
ggplot(full_df1, aes(x = price_category, y = price, fill = price_category)) +geom_boxplot() +labs(title ="Price Distribution by Category",x ="Price Category",y ="Price (£)",caption ="Data source: books.toscrape.com" ) +theme_bw()
Explanations
Problem 1:
This scatterplot shows how book prices vary across the dataset, with colors representing price categories. The regression line highlights the overall trend in pricing.
Problem 2:
This bar chart shows the number of books in each price category. It helps visualize how books are distributed across low, medium, and high price ranges.
Problem 3:
This boxplot compares the distribution of prices across the three categories. It clearly shows differences in spread and median price within each group.