The goal of this assignment is to Extend an Existing Example. Using one of my classmate’s examples (as created in the tidyverse CREATE assignment), extend his or her example with additional annotated code.
The example I pick to extend on is Carol’s; Carol chose to create tidyverse on a dataset that “offers an in-depth look into Amazon’s top 100 Bestselling books along with their customer reviews.” Carol Campbell’s Tidyverse CREATE Assignment
Throughout this .RMD document, I am going to write next to each heading or code whether it is Existing Code (EC) or Extended Existing Code (EEC)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'kableExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## group_rows
data <- read.csv("https://raw.githubusercontent.com/carolc57/Data607-Fall23/main/Amazon%20top%20100%20Trending%20Books.csv", header = TRUE, sep = ",")
amazon100 <-as.data.frame(data)
kable(head(amazon100, n = 5)) #show 10 rows only
Rank | book.title | book.price | rating | author | year.of.publication | genre | url |
---|---|---|---|---|---|---|---|
1 | Iron Flame (The Empyrean, 2) | 18.42 | 4.1 | Rebecca Yarros | 2023 | Fantasy Romance | amazon.com/Iron-Flame-Empyrean-Rebecca-Yarros/dp/1649374178/ref=zg_bs_g_books_sccl_1/143-9831347-1043253?psc=1 |
2 | The Woman in Me | 20.93 | 4.5 | Britney Spears | 2023 | Memoir | amazon.com/Woman-Me-Britney-Spears/dp/1668009048/ref=zg_bs_g_books_sccl_2/143-9831347-1043253?psc=1 |
3 | My Name Is Barbra | 31.50 | 4.5 | Barbra Streisand | 2023 | Autobiography | amazon.com/My-Name-Barbra-Streisand/dp/0525429522/ref=zg_bs_g_books_sccl_3/143-9831347-1043253?psc=1 |
4 | Friends, Lovers, and the Big Terrible Thing: A Memoir | 23.99 | 4.4 | Matthew Perry | 2023 | Memoir | amazon.com/Friends-Lovers-Big-Terrible-Thing/dp/1250866448/ref=zg_bs_g_books_sccl_4/143-9831347-1043253?psc=1 |
5 | How to Catch a Turkey | 5.65 | 4.8 | Adam Wallace | 2018 | Childrens, Fiction | amazon.com/How-Catch-Turkey-Adam-Wallace/dp/1492664359/ref=zg_bs_g_books_sccl_5/143-9831347-1043253?psc=1 |
What I tried to do here was to combine all functions in one R chunk. First, to filter by the year of publication -to focus on the years after Covid. Then, find the mean price of the books of each year to be able to compare how much the price had grown between 2020 and 2023. I also added a column to calculate the difference price between each two consecutive years.
Amazon_postCovid <-filter(amazon100, year.of.publication=="2020"|year.of.publication=="2021"|year.of.publication=="2022"|year.of.publication=="2023") %>%
group_by(year.of.publication)%>%
summarise(Average = mean(book.price))%>%
mutate(priceGrowth = Average-lag(Average))
head(Amazon_postCovid)
## # A tibble: 4 × 3
## year.of.publication Average priceGrowth
## <int> <dbl> <dbl>
## 1 2020 9.93 NA
## 2 2021 21.1 11.1
## 3 2022 12.9 -8.16
## 4 2023 17.5 4.64
The price had grown by more than 7 USD between 2020 and 2023. But by looking at the price growth column, we can see that the price grew by a little over 11 USD from 2020 to 2021. Then it dropped by about 8 USD between 2021 and 2022, then it grew by over 4 USD the next year. Let’s apply the same code on the entire dataset:
newAmazon100 <- amazon100 %>%
group_by(year.of.publication)%>%
summarise(Average = mean(book.price))%>%
mutate(priceGrowth = Average-lag(Average))
head(newAmazon100)
## # A tibble: 6 × 3
## year.of.publication Average priceGrowth
## <int> <dbl> <dbl>
## 1 1947 5.36 NA
## 2 1960 6.26 0.900
## 3 1967 8.56 2.30
## 4 1969 5.99 -2.57
## 5 1980 4.97 -1.02
## 6 1982 3.69 -1.28
To better visualize the price growth, we can plot a line graph:
ggplot(newAmazon100, aes(x=year.of.publication, y=Average)) +
geom_line()+
expand_limits(y=0)
Based on the line plot, we can see that, before 1998, the prices of the books were up and down but still less than 10 USD. The year of 2000 has the highest pick of all, the price was over 25 USD. Then, the prices went down and up again but not as high as 25 USD.
All tools in the librairies in the package tidyverse are very useful to tidy, manipulate and visualize the data. They either can be used individually (one tool each time, which can make the document longer), or they can be combined in only few R chunks. To be able to work with combined tools in R, I need to acquire and practice more on working with R.