Data 607 - Tidyverse CREATE pt1

Introduction:

The goal of this assignment practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

What is Tidyverse?

Tidyverse is a collection of R packages which contain tools for transforming and visualizing data. To date there are over 100 packages within the tidyverse library, however loaded with the core package are:

ggplot2, for data visualisation
dplyr, for data manipulation using 5 powerful verbs “filter, arrange, select , summarize, and mutate
tidyr, for data tidying, ie restructuring data
readr, for data import
purrr, for functional programming
tibble, for tibbles, a modern re-imagining of data frames
stringr, for strings and fast manipulation thereof
forcats, for factors
lubridate, for date/times

To see all the packages included with tidyverse use “tidyverse_packages()”.

Load libraries

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.0     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.0
## v ggplot2   3.4.3     v tibble    3.1.8
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.2     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Import data

For this vignette, I chose to analyze ‘Amazon’s Top 100 Bestselling Books’.

Rank	book.title	book.price	rating	author	year.of.publication	genre	url
1	Iron Flame (The Empyrean, 2)	18.42	4.1	Rebecca Yarros	2023	Fantasy Romance	amazon.com/Iron-Flame-Empyrean-Rebecca-Yarros/dp/1649374178/ref=zg_bs_g_books_sccl_1/143-9831347-1043253?psc=1
2	The Woman in Me	20.93	4.5	Britney Spears	2023	Memoir	amazon.com/Woman-Me-Britney-Spears/dp/1668009048/ref=zg_bs_g_books_sccl_2/143-9831347-1043253?psc=1
3	My Name Is Barbra	31.50	4.5	Barbra Streisand	2023	Autobiography	amazon.com/My-Name-Barbra-Streisand/dp/0525429522/ref=zg_bs_g_books_sccl_3/143-9831347-1043253?psc=1
4	Friends, Lovers, and the Big Terrible Thing: A Memoir	23.99	4.4	Matthew Perry	2023	Memoir	amazon.com/Friends-Lovers-Big-Terrible-Thing/dp/1250866448/ref=zg_bs_g_books_sccl_4/143-9831347-1043253?psc=1
5	How to Catch a Turkey	5.65	4.8	Adam Wallace	2018	Childrens, Fiction	amazon.com/How-Catch-Turkey-Adam-Wallace/dp/1492664359/ref=zg_bs_g_books_sccl_5/143-9831347-1043253?psc=1

About the Amazon- Top 100 Bestselling books dataset…

This dataset offers an in-depth look into Amazon’s top 100 Bestselling books along with their customer reviews. Whether you’re a book enthusiast, data scientist, or just curious about the latest literary trends, this dataset provides a window into the world of popular reading.

Book Rank: The ranking of the book among the top 100 Bestselling books on Amazon.
Book Title: The title of the book.
Price: The price of the book in USD.
Rating: The overall rating of the book, on a scale of 1 to 5.
Author: The author of the book.
Year of Publication: The year in which the book was published.
Genre: The genre or category to which the book belongs.
URL: The URL link to the book on Amazon’s platform.
Review Title: The title of the book review.
Reviewer: The name of the person who has written a review for the book.
Reviewer Rating: The rating given by the reviewer for the book, on a scale of 1 to 5.
Review Description: The text description of the review given.
Is_verified: Indicates whether the review is verified as a genuine customer review.
Date: The timestamp indicates the date when the review was posted.
Timestamp: The timestamp indicates when the review was posted.
ASIN: Amazon Standard Identification Number assigned to products on Amazon.

In order to begin our analysis we will need to see the composition of the variables, thus we’ll use the summary functions.

##       Rank         book.title          book.price         rating    
##  Min.   :  1.00   Length:100         Min.   : 2.780   Min.   :4.10  
##  1st Qu.: 25.75   Class :character   1st Qu.: 6.303   1st Qu.:4.60  
##  Median : 50.50   Mode  :character   Median :11.480   Median :4.70  
##  Mean   : 50.50                      Mean   :12.709   Mean   :4.69  
##  3rd Qu.: 75.25                      3rd Qu.:16.990   3rd Qu.:4.80  
##  Max.   :100.00                      Max.   :48.770   Max.   :5.00  
##                                                       NA's   :3     
##     author          year.of.publication    genre               url           
##  Length:100         Min.   :1947        Length:100         Length:100        
##  Class :character   1st Qu.:2014        Class :character   Class :character  
##  Mode  :character   Median :2019        Mode  :character   Mode  :character  
##                     Mean   :2014                                             
##                     3rd Qu.:2023                                             
##                     Max.   :2024                                             
##

Using summary() tells us that our dataset has 100 observations or rows and 8 variables or columns. 4 of the variables are quanlitative, easily identifiable as ‘chr’ datatype and 4 quantitative variables, identifiable as ‘int’ or ‘dbl’. In the case of quantitative variables the mean, min, max, mode 1st and 3rd quartiles are provided for each.

Using dplyr statements to slice and dice our dataframe

The first functions that we will explore are rename() and select(), where we will rename some columns, and drop others (through omission) with the select function

Renaming some columns, dropping others…

Some of our columns have long or redundant names, let’s change that with the rename (). At the same time we’ll drop the url column.

Rank	title	author	price	published	genre	rating
1	Iron Flame (The Empyrean, 2)	Rebecca Yarros	18.42	2023	Fantasy Romance	4.1
2	The Woman in Me	Britney Spears	20.93	2023	Memoir	4.5
3	My Name Is Barbra	Barbra Streisand	31.50	2023	Autobiography	4.5
4	Friends, Lovers, and the Big Terrible Thing: A Memoir	Matthew Perry	23.99	2023	Memoir	4.4
5	How to Catch a Turkey	Adam Wallace	5.65	2018	Childrens, Fiction	4.8

##       Rank           title              author              price       
##  Min.   :  1.00   Length:100         Length:100         Min.   : 2.780  
##  1st Qu.: 25.75   Class :character   Class :character   1st Qu.: 6.303  
##  Median : 50.50   Mode  :character   Mode  :character   Median :11.480  
##  Mean   : 50.50                                         Mean   :12.709  
##  3rd Qu.: 75.25                                         3rd Qu.:16.990  
##  Max.   :100.00                                         Max.   :48.770  
##                                                                         
##    published       genre               rating    
##  Min.   :1947   Length:100         Min.   :4.10  
##  1st Qu.:2014   Class :character   1st Qu.:4.60  
##  Median :2019   Mode  :character   Median :4.70  
##  Mean   :2014                      Mean   :4.69  
##  3rd Qu.:2023                      3rd Qu.:4.80  
##  Max.   :2024                      Max.   :5.00  
##                                    NA's   :3

Inquiring minds want to know…

Our dataset covers publication years 1947 through 2023, but we’re curious to know the top 100 of books for the year the years 2020 - 2023 (Covid 19 pandemic), and we only want those whose ratings were higher than 3.5. Lets see how we accomplish this…

filter() and arrange()

Rank	title	author	price	published	genre	rating
1	Iron Flame (The Empyrean, 2)	Rebecca Yarros	18.42	2023	Fantasy Romance	4.1
18	The Exchange: After The Firm (The Firm Series)	John Grisham	18.96	2023	Literary, Thrillers - Suspense	4.1
92	The Coworker	Freida McFadden	9.44	2023	Thriller, Mystery	4.2
7	Unwoke: How to Defeat Cultural Marxism in America	Unknown	27.43	2023	Nonfiction, Politics	4.3
64	Tom Lake: A Reese’s Book Club Pick	Ann Patchett	15.33	2023	Literary, Medical, Family Life - General, World Literature - India - General	4.3

It’s interesting to note that 47 of our original observation were in the top 100 during Covid 19 years of 2020 - 2023.

Grouping..

Let’s find the minimum and maximum price by published year using summarise and group

## # A tibble: 4 x 4
##   published Avgprice Minprice Maxprice
##       <int>    <dbl>    <dbl>    <dbl>
## 1      2020     9.93     5.37     15.8
## 2      2021    21.1     10.4      48.8
## 3      2022    12.9      6.93     19.9
## 4      2023    17.5      4.78     31.5

During the Covid-19 pandemic, the average price of books rose from $9.93 in 2020 to $17.54 in 2023.

Let’s use mutate to group our data into new columns based on user-defined conditions…

Using summarise, gather and arrange functions to group data

Frequency analysis of genrerefined

## # A tibble: 2 x 2
##   genre       freq
##   <chr>      <int>
## 1 fiction       63
## 2 nonfiction    37

Here we see that there are 63 fiction and 37 nonfiction books in our dataset. Let’s graph them using ggplot2.

ggplot2

ggplot 2 allows us to visualize our data in a variety of ways. Here we will use a simple barplot to illustrate our genre findings.

Here we can clearly see that fiction-like books outnumber nonfiction-like books by nearly 2 to 1.

Conclusion:

Tidyverse contains a myriad of tools that allows us to easily analyze data, find insights and see correlations. Other possibilities include the ability to do linear regressions (not illustrated here due to time constraints), and other statistical functions.