Tidyverse Extend Assignment

Introduction:

The goal of this assignment is to Extend an Existing Example. Using one of my classmate’s examples (as created in the tidyverse CREATE assignment), extend his or her example with additional annotated code.

The example I pick to extend on is Carol’s; Carol chose to create tidyverse on a dataset that “offers an in-depth look into Amazon’s top 100 Bestselling books along with their customer reviews.” Carol Campbell’s Tidyverse CREATE Assignment

Throughout this .RMD document, I am going to write next to each heading or code whether it is Existing Code (EC) or Extended Existing Code (EEC)

Loading Librairies (EC):

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Import the data (EC):

data <- read.csv("https://raw.githubusercontent.com/carolc57/Data607-Fall23/main/Amazon%20top%20100%20Trending%20Books.csv", header = TRUE, sep = ",")

amazon100 <-as.data.frame(data)

kable(head(amazon100, n = 5))  #show 10 rows only

Rank	book.title	book.price	rating	author	year.of.publication	genre	url
1	Iron Flame (The Empyrean, 2)	18.42	4.1	Rebecca Yarros	2023	Fantasy Romance	amazon.com/Iron-Flame-Empyrean-Rebecca-Yarros/dp/1649374178/ref=zg_bs_g_books_sccl_1/143-9831347-1043253?psc=1
2	The Woman in Me	20.93	4.5	Britney Spears	2023	Memoir	amazon.com/Woman-Me-Britney-Spears/dp/1668009048/ref=zg_bs_g_books_sccl_2/143-9831347-1043253?psc=1
3	My Name Is Barbra	31.50	4.5	Barbra Streisand	2023	Autobiography	amazon.com/My-Name-Barbra-Streisand/dp/0525429522/ref=zg_bs_g_books_sccl_3/143-9831347-1043253?psc=1
4	Friends, Lovers, and the Big Terrible Thing: A Memoir	23.99	4.4	Matthew Perry	2023	Memoir	amazon.com/Friends-Lovers-Big-Terrible-Thing/dp/1250866448/ref=zg_bs_g_books_sccl_4/143-9831347-1043253?psc=1
5	How to Catch a Turkey	5.65	4.8	Adam Wallace	2018	Childrens, Fiction	amazon.com/How-Catch-Turkey-Adam-Wallace/dp/1492664359/ref=zg_bs_g_books_sccl_5/143-9831347-1043253?psc=1

Grouping the data (EEC):

What I tried to do here was to combine all functions in one R chunk. First, to filter by the year of publication -to focus on the years after Covid. Then, find the mean price of the books of each year to be able to compare how much the price had grown between 2020 and 2023. I also added a column to calculate the difference price between each two consecutive years.

Amazon_postCovid <-filter(amazon100, year.of.publication=="2020"|year.of.publication=="2021"|year.of.publication=="2022"|year.of.publication=="2023") %>%
  group_by(year.of.publication)%>%
  summarise(Average = mean(book.price))%>%
  mutate(priceGrowth = Average-lag(Average))
head(Amazon_postCovid)

## # A tibble: 4 × 3
##   year.of.publication Average priceGrowth
##                 <int>   <dbl>       <dbl>
## 1                2020    9.93       NA   
## 2                2021   21.1        11.1 
## 3                2022   12.9        -8.16
## 4                2023   17.5         4.64

The price had grown by more than 7 USD between 2020 and 2023. But by looking at the price growth column, we can see that the price grew by a little over 11 USD from 2020 to 2021. Then it dropped by about 8 USD between 2021 and 2022, then it grew by over 4 USD the next year. Let’s apply the same code on the entire dataset:

newAmazon100 <- amazon100 %>%
  group_by(year.of.publication)%>%
  summarise(Average = mean(book.price))%>%
  mutate(priceGrowth = Average-lag(Average))
head(newAmazon100)

## # A tibble: 6 × 3
##   year.of.publication Average priceGrowth
##                 <int>   <dbl>       <dbl>
## 1                1947    5.36      NA    
## 2                1960    6.26       0.900
## 3                1967    8.56       2.30 
## 4                1969    5.99      -2.57 
## 5                1980    4.97      -1.02 
## 6                1982    3.69      -1.28

Visualize the data:

To better visualize the price growth, we can plot a line graph:

ggplot(newAmazon100, aes(x=year.of.publication, y=Average)) +
  geom_line()+
  expand_limits(y=0)

Based on the line plot, we can see that, before 1998, the prices of the books were up and down but still less than 10 USD. The year of 2000 has the highest pick of all, the price was over 25 USD. Then, the prices went down and up again but not as high as 25 USD.

Conclusion:

All tools in the librairies in the package tidyverse are very useful to tidy, manipulate and visualize the data. They either can be used individually (one tool each time, which can make the document longer), or they can be combined in only few R chunks. To be able to work with combined tools in R, I need to acquire and practice more on working with R.