[Tidyverse graphic]{width=“411”}

Introduction:

The goal of this assignment practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

What is Tidyverse?

Tidyverse is a collection of R packages which contain tools for transforming and visualizing data. To date there are over 100 packages within the tidyverse library, however loaded with the core package are:

To see all the packages included with tidyverse use “tidyverse_packages()”.

Load libraries

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.0     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.0
## v ggplot2   3.4.3     v tibble    3.1.8
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.2     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Import data

For this vignette, I chose to analyze ‘Amazon’s Top 100 Bestselling Books’.

Rank book.title book.price rating author year.of.publication genre url
1 Iron Flame (The Empyrean, 2) 18.42 4.1 Rebecca Yarros 2023 Fantasy Romance amazon.com/Iron-Flame-Empyrean-Rebecca-Yarros/dp/1649374178/ref=zg_bs_g_books_sccl_1/143-9831347-1043253?psc=1
2 The Woman in Me 20.93 4.5 Britney Spears 2023 Memoir amazon.com/Woman-Me-Britney-Spears/dp/1668009048/ref=zg_bs_g_books_sccl_2/143-9831347-1043253?psc=1
3 My Name Is Barbra 31.50 4.5 Barbra Streisand 2023 Autobiography amazon.com/My-Name-Barbra-Streisand/dp/0525429522/ref=zg_bs_g_books_sccl_3/143-9831347-1043253?psc=1
4 Friends, Lovers, and the Big Terrible Thing: A Memoir 23.99 4.4 Matthew Perry 2023 Memoir amazon.com/Friends-Lovers-Big-Terrible-Thing/dp/1250866448/ref=zg_bs_g_books_sccl_4/143-9831347-1043253?psc=1
5 How to Catch a Turkey 5.65 4.8 Adam Wallace 2018 Childrens, Fiction amazon.com/How-Catch-Turkey-Adam-Wallace/dp/1492664359/ref=zg_bs_g_books_sccl_5/143-9831347-1043253?psc=1

About the Amazon- Top 100 Bestselling books dataset…

This dataset offers an in-depth look into Amazon’s top 100 Bestselling books along with their customer reviews. Whether you’re a book enthusiast, data scientist, or just curious about the latest literary trends, this dataset provides a window into the world of popular reading.

In order to begin our analysis we will need to see the composition of the variables, thus we’ll use the summary functions.

##       Rank         book.title          book.price         rating    
##  Min.   :  1.00   Length:100         Min.   : 2.780   Min.   :4.10  
##  1st Qu.: 25.75   Class :character   1st Qu.: 6.303   1st Qu.:4.60  
##  Median : 50.50   Mode  :character   Median :11.480   Median :4.70  
##  Mean   : 50.50                      Mean   :12.709   Mean   :4.69  
##  3rd Qu.: 75.25                      3rd Qu.:16.990   3rd Qu.:4.80  
##  Max.   :100.00                      Max.   :48.770   Max.   :5.00  
##                                                       NA's   :3     
##     author          year.of.publication    genre               url           
##  Length:100         Min.   :1947        Length:100         Length:100        
##  Class :character   1st Qu.:2014        Class :character   Class :character  
##  Mode  :character   Median :2019        Mode  :character   Mode  :character  
##                     Mean   :2014                                             
##                     3rd Qu.:2023                                             
##                     Max.   :2024                                             
## 

Using summary() tells us that our dataset has 100 observations or rows and 8 variables or columns. 4 of the variables are quanlitative, easily identifiable as ‘chr’ datatype and 4 quantitative variables, identifiable as ‘int’ or ‘dbl’. In the case of quantitative variables the mean, min, max, mode 1st and 3rd quartiles are provided for each.

Using dplyr statements to slice and dice our dataframe

The first functions that we will explore are rename() and select(), where we will rename some columns, and drop others (through omission) with the select function

Renaming some columns, dropping others…

Some of our columns have long or redundant names, let’s change that with the rename (). At the same time we’ll drop the url column.

Rank title author price published genre rating
1 Iron Flame (The Empyrean, 2) Rebecca Yarros 18.42 2023 Fantasy Romance 4.1
2 The Woman in Me Britney Spears 20.93 2023 Memoir 4.5
3 My Name Is Barbra Barbra Streisand 31.50 2023 Autobiography 4.5
4 Friends, Lovers, and the Big Terrible Thing: A Memoir Matthew Perry 23.99 2023 Memoir 4.4
5 How to Catch a Turkey Adam Wallace 5.65 2018 Childrens, Fiction 4.8
##       Rank           title              author              price       
##  Min.   :  1.00   Length:100         Length:100         Min.   : 2.780  
##  1st Qu.: 25.75   Class :character   Class :character   1st Qu.: 6.303  
##  Median : 50.50   Mode  :character   Mode  :character   Median :11.480  
##  Mean   : 50.50                                         Mean   :12.709  
##  3rd Qu.: 75.25                                         3rd Qu.:16.990  
##  Max.   :100.00                                         Max.   :48.770  
##                                                                         
##    published       genre               rating    
##  Min.   :1947   Length:100         Min.   :4.10  
##  1st Qu.:2014   Class :character   1st Qu.:4.60  
##  Median :2019   Mode  :character   Median :4.70  
##  Mean   :2014                      Mean   :4.69  
##  3rd Qu.:2023                      3rd Qu.:4.80  
##  Max.   :2024                      Max.   :5.00  
##                                    NA's   :3

Inquiring minds want to know…

Our dataset covers publication years 1947 through 2023, but we’re curious to know the top 100 of books for the year the years 2020 - 2023 (Covid 19 pandemic), and we only want those whose ratings were higher than 3.5. Lets see how we accomplish this…

filter() and arrange()

Rank title author price published genre rating
1 Iron Flame (The Empyrean, 2) Rebecca Yarros 18.42 2023 Fantasy Romance 4.1
18 The Exchange: After The Firm (The Firm Series) John Grisham 18.96 2023 Literary, Thrillers - Suspense 4.1
92 The Coworker Freida McFadden 9.44 2023 Thriller, Mystery 4.2
7 Unwoke: How to Defeat Cultural Marxism in America Unknown 27.43 2023 Nonfiction, Politics 4.3
64 Tom Lake: A Reese’s Book Club Pick Ann Patchett 15.33 2023 Literary, Medical, Family Life - General, World Literature - India - General 4.3

It’s interesting to note that 47 of our original observation were in the top 100 during Covid 19 years of 2020 - 2023.

Grouping..

Let’s find the minimum and maximum price by published year using summarise and group

## # A tibble: 4 x 4
##   published Avgprice Minprice Maxprice
##       <int>    <dbl>    <dbl>    <dbl>
## 1      2020     9.93     5.37     15.8
## 2      2021    21.1     10.4      48.8
## 3      2022    12.9      6.93     19.9
## 4      2023    17.5      4.78     31.5

During the Covid-19 pandemic, the average price of books rose from $9.93 in 2020 to $17.54 in 2023.

Let’s use mutate to group our data into new columns based on user-defined conditions…

Using summarise, gather and arrange functions to group data

Frequency analysis of genrerefined

## # A tibble: 2 x 2
##   genre       freq
##   <chr>      <int>
## 1 fiction       63
## 2 nonfiction    37

Here we see that there are 63 fiction and 37 nonfiction books in our dataset. Let’s graph them using ggplot2.

ggplot2

ggplot 2 allows us to visualize our data in a variety of ways. Here we will use a simple barplot to illustrate our genre findings.

Here we can clearly see that fiction-like books outnumber nonfiction-like books by nearly 2 to 1.

Conclusion:

Tidyverse contains a myriad of tools that allows us to easily analyze data, find insights and see correlations. Other possibilities include the ability to do linear regressions (not illustrated here due to time constraints), and other statistical functions.