[Tidyverse graphic]{width=“411”}
The goal of this assignment practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.
Tidyverse is a collection of R packages which contain tools for transforming and visualizing data. To date there are over 100 packages within the tidyverse library, however loaded with the core package are:
ggplot2, for data visualisation
dplyr, for data manipulation using 5 powerful verbs “filter, arrange, select , summarize, and mutate
tidyr, for data tidying, ie restructuring data
readr, for data import
purrr, for functional programming
tibble, for tibbles, a modern re-imagining of data frames
stringr, for strings and fast manipulation thereof
forcats, for factors
lubridate, for date/times
To see all the packages included with tidyverse use “tidyverse_packages()”.
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr 1.1.0 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.0
## v ggplot2 3.4.3 v tibble 3.1.8
## v lubridate 1.9.2 v tidyr 1.3.0
## v purrr 1.0.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'kableExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## group_rows
For this vignette, I chose to analyze ‘Amazon’s Top 100 Bestselling Books’.
| Rank | book.title | book.price | rating | author | year.of.publication | genre | url |
|---|---|---|---|---|---|---|---|
| 1 | Iron Flame (The Empyrean, 2) | 18.42 | 4.1 | Rebecca Yarros | 2023 | Fantasy Romance | amazon.com/Iron-Flame-Empyrean-Rebecca-Yarros/dp/1649374178/ref=zg_bs_g_books_sccl_1/143-9831347-1043253?psc=1 |
| 2 | The Woman in Me | 20.93 | 4.5 | Britney Spears | 2023 | Memoir | amazon.com/Woman-Me-Britney-Spears/dp/1668009048/ref=zg_bs_g_books_sccl_2/143-9831347-1043253?psc=1 |
| 3 | My Name Is Barbra | 31.50 | 4.5 | Barbra Streisand | 2023 | Autobiography | amazon.com/My-Name-Barbra-Streisand/dp/0525429522/ref=zg_bs_g_books_sccl_3/143-9831347-1043253?psc=1 |
| 4 | Friends, Lovers, and the Big Terrible Thing: A Memoir | 23.99 | 4.4 | Matthew Perry | 2023 | Memoir | amazon.com/Friends-Lovers-Big-Terrible-Thing/dp/1250866448/ref=zg_bs_g_books_sccl_4/143-9831347-1043253?psc=1 |
| 5 | How to Catch a Turkey | 5.65 | 4.8 | Adam Wallace | 2018 | Childrens, Fiction | amazon.com/How-Catch-Turkey-Adam-Wallace/dp/1492664359/ref=zg_bs_g_books_sccl_5/143-9831347-1043253?psc=1 |
This dataset offers an in-depth look into Amazon’s top 100 Bestselling books along with their customer reviews. Whether you’re a book enthusiast, data scientist, or just curious about the latest literary trends, this dataset provides a window into the world of popular reading.
In order to begin our analysis we will need to see the composition of the variables, thus we’ll use the summary functions.
## Rank book.title book.price rating
## Min. : 1.00 Length:100 Min. : 2.780 Min. :4.10
## 1st Qu.: 25.75 Class :character 1st Qu.: 6.303 1st Qu.:4.60
## Median : 50.50 Mode :character Median :11.480 Median :4.70
## Mean : 50.50 Mean :12.709 Mean :4.69
## 3rd Qu.: 75.25 3rd Qu.:16.990 3rd Qu.:4.80
## Max. :100.00 Max. :48.770 Max. :5.00
## NA's :3
## author year.of.publication genre url
## Length:100 Min. :1947 Length:100 Length:100
## Class :character 1st Qu.:2014 Class :character Class :character
## Mode :character Median :2019 Mode :character Mode :character
## Mean :2014
## 3rd Qu.:2023
## Max. :2024
##
Using summary() tells us that our dataset has 100 observations or rows and 8 variables or columns. 4 of the variables are quanlitative, easily identifiable as ‘chr’ datatype and 4 quantitative variables, identifiable as ‘int’ or ‘dbl’. In the case of quantitative variables the mean, min, max, mode 1st and 3rd quartiles are provided for each.
The first functions that we will explore are rename() and select(), where we will rename some columns, and drop others (through omission) with the select function
Some of our columns have long or redundant names, let’s change that with the rename (). At the same time we’ll drop the url column.
| Rank | title | author | price | published | genre | rating |
|---|---|---|---|---|---|---|
| 1 | Iron Flame (The Empyrean, 2) | Rebecca Yarros | 18.42 | 2023 | Fantasy Romance | 4.1 |
| 2 | The Woman in Me | Britney Spears | 20.93 | 2023 | Memoir | 4.5 |
| 3 | My Name Is Barbra | Barbra Streisand | 31.50 | 2023 | Autobiography | 4.5 |
| 4 | Friends, Lovers, and the Big Terrible Thing: A Memoir | Matthew Perry | 23.99 | 2023 | Memoir | 4.4 |
| 5 | How to Catch a Turkey | Adam Wallace | 5.65 | 2018 | Childrens, Fiction | 4.8 |
## Rank title author price
## Min. : 1.00 Length:100 Length:100 Min. : 2.780
## 1st Qu.: 25.75 Class :character Class :character 1st Qu.: 6.303
## Median : 50.50 Mode :character Mode :character Median :11.480
## Mean : 50.50 Mean :12.709
## 3rd Qu.: 75.25 3rd Qu.:16.990
## Max. :100.00 Max. :48.770
##
## published genre rating
## Min. :1947 Length:100 Min. :4.10
## 1st Qu.:2014 Class :character 1st Qu.:4.60
## Median :2019 Mode :character Median :4.70
## Mean :2014 Mean :4.69
## 3rd Qu.:2023 3rd Qu.:4.80
## Max. :2024 Max. :5.00
## NA's :3
Our dataset covers publication years 1947 through 2023, but we’re curious to know the top 100 of books for the year the years 2020 - 2023 (Covid 19 pandemic), and we only want those whose ratings were higher than 3.5. Lets see how we accomplish this…
| Rank | title | author | price | published | genre | rating |
|---|---|---|---|---|---|---|
| 1 | Iron Flame (The Empyrean, 2) | Rebecca Yarros | 18.42 | 2023 | Fantasy Romance | 4.1 |
| 18 | The Exchange: After The Firm (The Firm Series) | John Grisham | 18.96 | 2023 | Literary, Thrillers - Suspense | 4.1 |
| 92 | The Coworker | Freida McFadden | 9.44 | 2023 | Thriller, Mystery | 4.2 |
| 7 | Unwoke: How to Defeat Cultural Marxism in America | Unknown | 27.43 | 2023 | Nonfiction, Politics | 4.3 |
| 64 | Tom Lake: A Reese’s Book Club Pick | Ann Patchett | 15.33 | 2023 | Literary, Medical, Family Life - General, World Literature - India - General | 4.3 |
It’s interesting to note that 47 of our original observation were in the top 100 during Covid 19 years of 2020 - 2023.
Let’s find the minimum and maximum price by published year using summarise and group
## # A tibble: 4 x 4
## published Avgprice Minprice Maxprice
## <int> <dbl> <dbl> <dbl>
## 1 2020 9.93 5.37 15.8
## 2 2021 21.1 10.4 48.8
## 3 2022 12.9 6.93 19.9
## 4 2023 17.5 4.78 31.5
During the Covid-19 pandemic, the average price of books rose from $9.93 in 2020 to $17.54 in 2023.
Let’s use mutate to group our data into new columns based on user-defined conditions…
Frequency analysis of genrerefined
## # A tibble: 2 x 2
## genre freq
## <chr> <int>
## 1 fiction 63
## 2 nonfiction 37
Here we see that there are 63 fiction and 37 nonfiction books in our dataset. Let’s graph them using ggplot2.
ggplot 2 allows us to visualize our data in a variety of ways. Here we will use a simple barplot to illustrate our genre findings.
Here we can clearly see that fiction-like books outnumber nonfiction-like books by nearly 2 to 1.
Tidyverse contains a myriad of tools that allows us to easily analyze data, find insights and see correlations. Other possibilities include the ability to do linear regressions (not illustrated here due to time constraints), and other statistical functions.