Data Wrangling (Data Preprocessing)

Setup

library(kableExtra)
library(tidyverse) #Includes tibble, tidyr, readr, dplyr, stringr
library(rlang) #Base types and Tidyverse features
library(knitr) #For computing & reporting
library(magrittr) 
library(gdata) #For data manipulation
library(tinytex)

Student names, numbers and percentage of contributions

Group information
Student name	Student number	Percentage of contribution
Lai Teng Wong	s3714421	100%

Data Description

The data set was taken from Kaggle: https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019, licensed by Creative Commons CC0 1.0: Public Domain. The data set was saved as “bestsellers with categories.csv” in my working directory (I did not rename the file from Kaggle).

This is a data set on Amazon’s bestselling books from 2009 to 2019. The data was scrapped on October 2020 and categorized into fiction and non-fiction using Goodreads. The data set contains 550 rows and 7 columns, the first column refers to the Name of the books and the other 6 columns are variables related to each book. The data set consists of character/categorical (qualitative) and numerical (quantitative) variables.

Variables in this data set contains:

Name: The name of the book
Author: The author of the book
User Rating: User ratings of the book on Amazon
Reviews: Number of written user reviews for the book on Amazon
Price: The price of the book
Year: The year(s) each book was ranked on Amazon’s bestseller list
Genre: The Genre of the book: “Fiction” or “Non-Fiction”

Read/Import Data

setwd("C:/Users/laite/Desktop/Data Wrangling/Practical Assessment 1") #Set working directory
amazon <- read_csv("bestsellers with categories.csv") #Read csv file downloaded from Kaggle

## Rows: 550 Columns: 7

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Name, Author, Genre
## dbl (4): User Rating, Reviews, Price, Year

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

is.data.frame(amazon) #Check if the initial csv file has been loaded as a data frame

## [1] TRUE

class(amazon) #Check the class of amazon data set

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

amazon_df <- as.data.frame(amazon) #Save amazon data set as a data frame
is.data.frame(amazon_df) #Check if amazon_df is a data frame

## [1] TRUE

class(amazon_df) #Check the class of the amazon_df

## [1] "data.frame"

amazon_df %>% head(3) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #View the first 3 rows and scale down the size of output

Name	Author	User Rating	Reviews	Price	Year	Genre
10-Day Green Smoothie Cleanse	JJ Smith	4.7	17350	8	2016	Non Fiction
11/22/63: A Novel	Stephen King	4.6	2052	22	2011	Fiction
12 Rules for Life: An Antidote to Chaos	Jordan B. Peterson	4.7	18979	15	2018	Non Fiction

First, I set my working directory and load the csv file which was downloaded from Kaggle using read_csv function with Readr package and assigned it to: amazon.

The type of columns “Name”, “Author” and “Genre” is “character” because those variables are strings, and the type of columns “User Rating”, “Reviews”, “Price” and “Year” is “double” because those variables contain numeric values. R is reading “Reviews”, “Price” and “Year” as double precision floating point numbers instead of integers, even though the values in those variables are whole numbers.

The data set amazon was initially loaded into R as a data frame, I checked it with is.data.frame() function which returns output = TRUE, however class(amazon) returns output: “spec_tbl_df” “tbl_df” “tbl” “data.frame”, this means that amazon is a tibble and a data frame at the same time. Normally I wouldn’t change anything, since tibble is also a data frame, but as required by this assignment to specifically save the data set as a data frame, I used as.data.frame() to coerce the data set to be a data frame and assigned it to a variable: amazon_df. class(amazon_df) now returns output: “data.frame”.

I used head(3) to view the first 3 rows of the data set and kableExtra package functions to scale down and condense the output of amazon_df.

I did not use stringsAsFactors=TRUE here because I used read_csv function instead of read.csv. R did not automatically treat strings as factors.

Inspect and Understand

any(is.na(amazon_df)) #Check for any missing values in our data frame

## [1] FALSE

dim(amazon_df) #Check the dimensions of the data frame.

## [1] 550   7

colnames(amazon_df) #Check the column names in the data frame

## [1] "Name"        "Author"      "User Rating" "Reviews"     "Price"      
## [6] "Year"        "Genre"

amazon_df <- amazon_df %>% rename(User_Rating = `User Rating`) #Rename "User Rating" to "User_Rating" 
colnames(amazon_df) #View the after change column names

## [1] "Name"        "Author"      "User_Rating" "Reviews"     "Price"      
## [6] "Year"        "Genre"

str(amazon_df) #Check the data types of the variables in the data set

## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : chr  "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
##  $ Author     : chr  "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
##  $ User_Rating: num  4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
##  $ Reviews    : num  17350 2052 18979 21424 7665 ...
##  $ Price      : num  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : num  2016 2011 2018 2017 2019 ...
##  $ Genre      : chr  "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...

#Factorize the categorical variables: "Name", "Author", "Genre", they are non-ordered categorical variables 
amazon_df$Name <- amazon_df$Name %>% as.factor()
amazon_df$Author <- amazon_df$Author %>% as.factor()
amazon_df$Genre <- amazon_df$Genre %>% as.factor()

#Factorize "User_Rating" column, label "4" to "4.0" and ordered the levels
amazon_df$User_Rating <- factor(amazon_df$User_Rating, labels=c("3.3", "3.6", "3.8", "3.9", "4.0", "4.1", "4.2", "4.3", "4.4", "4.5", "4.6", "4.7", "4.8", "4.9"), ordered=TRUE)
head(amazon_df$User_Rating) #Check that User_Rating has been ordered properly

## [1] 4.7 4.6 4.7 4.7 4.8 4.4
## 14 Levels: 3.3 < 3.6 < 3.8 < 3.9 < 4.0 < 4.1 < 4.2 < 4.3 < 4.4 < ... < 4.9

class(amazon_df$User_Rating) #Check the class of User_Rating

## [1] "ordered" "factor"

str(amazon_df$User_Rating) #Check the structure of User_Rating

##  Ord.factor w/ 14 levels "3.3"<"3.6"<"3.8"<..: 12 11 12 12 13 9 12 12 12 11 ...

amazon_df$Year <- factor(amazon_df$Year, ordered=TRUE) #Factorize "Year" column, ordered the values within Year
head(amazon_df$Year) #Check that Year has been ordered properly

## [1] 2016 2011 2018 2017 2019 2011
## 11 Levels: 2009 < 2010 < 2011 < 2012 < 2013 < 2014 < 2015 < 2016 < ... < 2019

str(amazon_df) #Check the data types in our data frame again

## 'data.frame':    550 obs. of  7 variables:
##  $ Name       : Factor w/ 351 levels "10-Day Green Smoothie Cleanse",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Author     : Factor w/ 248 levels "Abraham Verghese",..: 125 220 135 96 175 97 97 13 115 90 ...
##  $ User_Rating: Ord.factor w/ 14 levels "3.3"<"3.6"<"3.8"<..: 12 11 12 12 13 9 12 12 12 11 ...
##  $ Reviews    : num  17350 2052 18979 21424 7665 ...
##  $ Price      : num  8 22 15 6 12 11 30 15 3 8 ...
##  $ Year       : Ord.factor w/ 11 levels "2009"<"2010"<..: 8 3 10 9 11 3 6 9 10 8 ...
##  $ Genre      : Factor w/ 2 levels "Fiction","Non Fiction": 2 1 2 1 2 1 1 1 2 1 ...

summary(amazon_df) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #Run the summary function

Name	Author	User_Rating	Reviews	Price	Year	Genre
Publication Manual of the American Psychological Association, 6th Edition : 10	Jeff Kinney : 12	4.8 :127	Min. : 37	Min. : 0.0	2009 : 50	Fiction :240
StrengthsFinder 2.0 : 9	Gary Chapman : 11	4.7 :108	1st Qu.: 4058	1st Qu.: 7.0	2010 : 50	Non Fiction:310
Oh, the Places You’ll Go! : 8	Rick Riordan : 11	4.6 :105	Median : 8580	Median : 11.0	2011 : 50	NA
The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change: 7	Suzanne Collins : 11	4.5 : 60	Mean :11953	Mean : 13.1	2012 : 50	NA
The Very Hungry Caterpillar : 7	American Psychological Association: 10	4.9 : 52	3rd Qu.:17253	3rd Qu.: 16.0	2013 : 50	NA
Jesus Calling: Enjoying Peace in His Presence (with Scripture References) : 6	Dr. Seuss : 9	4.4 : 38	Max. :87841	Max. :105.0	2014 : 50	NA
(Other) :503	(Other) :486	(Other): 60	NA	NA	(Other):250	NA

#It seems like some books were repetitive under "Name"
length(unique(amazon_df$Name)) #Check for the number of unique values within variable "Name"

## [1] 351

sub1 <- subset(amazon_df, duplicated(amazon_df$Name)) #Subset duplicated "Names" values from amazon_df 
sub1 %>% tail(3) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #View the last 3 rows

	Name	Author	User_Rating	Reviews	Price	Year	Genre
548	You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life	Jen Sincero	4.7	14331	8	2017	Non Fiction
549	You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life	Jen Sincero	4.7	14331	8	2018	Non Fiction
550	You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life	Jen Sincero	4.7	14331	8	2019	Non Fiction

Firstly, any(is.na(amazon_df)) returns output=FALSE, which means there are no missing values in the data set to deal with. Dimension of amazon_df returns: 550 rows and 7 columns (Column headers and row indexes are not counted). The column names of the original data set seemed fine, I renamed “User Rating” to “User_Rating” to make it easier to quote the variable later. I checked the data types of each column using the str() function: “Name”, “Author” and “Genre” are “character”, while “User_Rating”, “Reviews”, “Price” and “Year” are “numeric”.

I factorized all the strings in our data frame: “Name”, “Author” and “Genre”, they are non-ordered categorical variables, after factorizing, they are non-ordered factors.

“User_Rating” was read as “numeric” in R, but in fact it should be a categorical (ordinal) variable. I factorized “User_Rating”, renamed value “4” to “4.0”, and ordered the levels. Given that there is no missing values in my data set, I did not specifically state the levels, as R will automatically arrange the values in an ascending order, from the lowest rating “3.3” to the highest rating “4.9”. I checked whether “User_Rating” has been factorized and ordered correctly using head(), class() and str() functions. It is now an ordered factor with 14 levels “3.3”<“3.6”<….<“4.9”.

“Year” was also read as “numeric” in R. In this case, “Year” should be considered as a categorical variable because it represents data collected for books between 2009 to 2019 (it doesn’t make sense to perform any statistical analysis such as taking the mean/median value on “Year”, the ratio between two years is also not meaningful), therefore “Year” would be more appropriately classified as an ordered factor. I factorized “Year” and ordered them from 2009 to 2019: “2009”<“2010”<…<“2019”.

“Price” and “Reviews” are quantitative variables, so I left it as “numeric”.

I used str(amazon_df) to check that I have factorize and/or ordered the variables mentioned above correctly.

I ran the summary() function to check the frequencies for categorical variables and statistical summaries for numeric variables, It turns out that some books (“Names”) were repetitive. I checked for the number of unique values in “names”, the output returns 351, which means 550-351=199 book names were repeated. To check whether the books have duplicate values across all the columns, I used the subset() function to extract all the rows with duplicate values under the “Names” variable. Here I am just showing an example of a book “You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life” which were repeated 3 times in amazon_df, but has different values under “Year” column, which means that the book was ranked on Amazon for consecutive years 2017, 2018, 2019. Given that the values in “Price”, “Reviews” and “Ratings” are the same across those years, it could mean that they retained the values from the latest year which the book was ranked on Amazon. Given that they are not actually “duplicates”, I did not remove them from amazon_df.

Subsetting

amazon_df_sub <- amazon_df[1:10, ] #Subset the data frame using the first 10 observations, including all variables
amazon_df_sub %>% tail(3) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #View the last 3 rows

	Name	Author	User_Rating	Reviews	Price	Year	Genre
8	A Gentleman in Moscow: A Novel	Amor Towles	4.7	19699	15	2017	Fiction
9	A Higher Loyalty: Truth, Lies, and Leadership	James Comey	4.7	5983	3	2018	Non Fiction
10	A Man Called Ove: A Novel	Fredrik Backman	4.6	23848	8	2016	Fiction

amazon_matrix <- as.matrix(amazon_df_sub) #Convert the data frame above to a matrix
is.matrix(amazon_matrix) #Check if it is a matrix

## [1] TRUE

class(amazon_matrix) #Check the class of the matrix

## [1] "matrix" "array"

str(amazon_matrix) #Check the structure of the matrix

##  chr [1:10, 1:7] "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10] "1" "2" "3" "4" ...
##   ..$ : chr [1:7] "Name" "Author" "User_Rating" "Reviews" ...

typeof(amazon_matrix) #Check the type of the matrix

## [1] "character"

I subset amazon_df to the first 10 rows using the subset() function, and assigned this subset to: amazon_df_sub. tail(3) shows the last 3 rows of the subset which are rows 8,9,10. This shows that the subset definitely has 10 rows and all 7 variables (columns).

Then I converted the subset into a matrix using the as.matrix() function, and assigned this matrix to: amazon_matrix. I used is.matrix() function to check that the subset is now a matrix, the output returns TRUE. By running the class() function, I am checking the class of this matrix, the output returns: “matrix”, “array”, which means amazon_matrix is a matrix, equivalent to a two-dimensional array now.

I run the str() function to check the structure of the matrix and typeof() to check the type of the matrix. The output returns: chr which means the matrix is now a “character” matrix. This is because all elements of a matrix must be of the same class and of the same length. Since the elements within amazon_df_sub are of “numeric” and “factors” classes, R has coerced all the elements in the matrix to the most flexible class when as.matrix() function was applied to convert amazon_df_sub to amazon_matrix. In this case, “character” is more flexible than “numeric”, so the type of the matrix is now “character”. As for the condition of matrix to be of the same length, it is not an issue here because technically a data frame was already a list of equal-length vectors, so when I converted amazon_df_sub to a matrix, the length remains the same.

Create a new Data Frame

df <- data.frame(matrix(0, ncol = 2, nrow = 10)) #Create an empty data frame with 10 rows and 2 columns

set.seed(1) #Generate an integer variable: Age 
Age <- sample(20L:35L, 10, replace=TRUE) 

set.seed(1) #Generate an ordinal variable: Final_Assignment_Grade & order the variable 
Final_Assignment_Grade <- factor(sample(c("0-49","50-59","60-69","70-79","80-100"),10, replace=TRUE), levels=c("0-49","50-59","60-69","70-79","80-100"), labels=c("Fail","Pass","Credit","Distinction","High Distinction"), ordered=TRUE) 

df <- data.frame(Age,Final_Assignment_Grade) #Place two of the variables I have created into my dataframe: df 
df$Age #View 10 observations in Age

##  [1] 28 23 26 20 21 32 26 30 33 21

df$Final_Assignment_Grade #View the 10 observations in Final_Assignment_Grade and ordered levels

##  [1] Fail             Distinction      Fail             Pass            
##  [5] High Distinction Credit           Pass             Credit          
##  [9] Credit           Fail            
## Levels: Fail < Pass < Credit < Distinction < High Distinction

set.seed(1) #Generate a numeric variable: Final_Exam_GPA 
gpa <- seq(from=0, to=4.0, by=.1)
Final_Exam_GPA <- sample(gpa, size=10, replace=TRUE) 

df <- cbind(df,Final_Exam_GPA) #Bind the numeric variable: Final_Exam_GPA to my df using cbind()
df$Final_Exam_GPA #View the 10 observations in Final_Exam_GPA

##  [1] 0.3 3.8 0.0 3.3 2.2 1.3 1.7 3.2 2.0 2.0

rownames(df) <- c("Student 1","Student 2","Student 3","Student 4","Student 5","Student 6","Student 7","Student 8","Student 9","Student 10") #Rename the row indexes of df

df %>% tail(3) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #View the last 3 rows of the df with all 3 variables

	Age	Final_Assignment_Grade	Final_Exam_GPA
Student 8	30	Credit	3.2
Student 9	33	Credit	2.0
Student 10	21	Fail	2.0

str(df) #Check the final structure all the variables in df

## 'data.frame':    10 obs. of  3 variables:
##  $ Age                   : int  28 23 26 20 21 32 26 30 33 21
##  $ Final_Assignment_Grade: Ord.factor w/ 5 levels "Fail"<"Pass"<..: 1 4 1 2 5 3 2 3 3 1
##  $ Final_Exam_GPA        : num  0.3 3.8 0 3.3 2.2 1.3 1.7 3.2 2 2

I created an empty data frame with 10 rows and 2 columns: df to start with (though this step is not necessary). I am creating a data frame of 10 observations consisting of 10 individual students.

Note: Instead of setting the values for each variable myself, I am asking R to generate 10 random values for all the variables in df.

For all the variables created below, I used the set.seed() function so that it allows me to generate a sequence of random numbers and then reproduce that same sequence of random numbers after that.
I have set replace=TRUE to allow R to repeat any values in each random sample generated.

Firstly, I created an integer variable: Age, this stands for each student’s age. Here I have set the Age’s range to be 20 to 35 (which is normally the age range for university students). I added “L” behind the numbers to coerce that I want the class of the variable to be an integer, even though “21:35” will also give us an integer class.

Secondly, I created an ordinal variable: Final_Assignment_Grade, this stands for each student’s final assignment grade. I have factorized the variable using factor(), the levels of this variable consists of “0-49”,“50-59”,“60-69”,“70-79”,“80-100” which are the marks for students’ final assignment. I labelled these marks as “Grades”: “Fail”,“Pass”,“Credit”,“Distinction”,“High Distinction”, so that it makes more sense. This means “Fail” for “0-49”, “Pass” for “50-59”, “Credit” for “60-69”, “Distinction” for “70-79” and “High Distinction” for “80-100”. I have also ordered the grades: “Fail”<“Pass”<“Credit”<“Distinction”<“High Distinction”. Final_Assignment_Grade is an ordered factor with 5 levels from “Fail” the lowest grade to “High Distinction” the highest grade.

Then, I put both variables “Age” and “Final_Assignment_Grade” into df using the data.frame() function.

Thirdly, I created a numerical variable: Final_Exam_GPA, this stands for each student’s Final Exam GPA. I set the random values to range from 0 to 4.0, with an incremental difference value of 0.1 (which is typically the case for GPA). Here it is a numerical variable not an integer variable as GPA has 1 decimal point, they are not whole numbers.

Lastly, I used the cbind function to add the numerical variable: Final_Exam_GPA to my data frame: df. I renamed the row indexes of df from “Student 1” to “Student 10”. str(df) shows that df now consists of 10 observations and 3 variables: Age (integer), Final_Assignment_Grade (ordered factor) and Final_Exam_GPA (numeric).

Reference List

Saalu, S. (2020). Amazon Top 50 Bestselling Books 2009 - 2019. https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019

Taheri, S. (2021). Module 2 Get: Importing, Scraping and Exporting Data with R [Module Webpage]. Canvas @ RMIT University, http://rare-phoenix-161610.appspot.com/secured/Module_02.html

Taheri, S. (2021). Module 3 Understand: Understanding Data and Data Structures [Module Webpage]. Canvas @ RMIT University. http://rare-phoenix-161610.appspot.com/secured/Module_03.html