library(kableExtra)
library(tidyverse) #Includes tibble, tidyr, readr, dplyr, stringr
library(rlang) #Base types and Tidyverse features
library(knitr) #For computing & reporting
library(magrittr)
library(gdata) #For data manipulation
library(tinytex)
| Student name | Student number | Percentage of contribution |
|---|---|---|
| Lai Teng Wong | s3714421 | 100% |
The data set was taken from Kaggle: https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019, licensed by Creative Commons CC0 1.0: Public Domain. The data set was saved as “bestsellers with categories.csv” in my working directory (I did not rename the file from Kaggle).
This is a data set on Amazon’s bestselling books from 2009 to 2019. The data was scrapped on October 2020 and categorized into fiction and non-fiction using Goodreads. The data set contains 550 rows and 7 columns, the first column refers to the Name of the books and the other 6 columns are variables related to each book. The data set consists of character/categorical (qualitative) and numerical (quantitative) variables.
Variables in this data set contains:
setwd("C:/Users/laite/Desktop/Data Wrangling/Practical Assessment 1") #Set working directory
amazon <- read_csv("bestsellers with categories.csv") #Read csv file downloaded from Kaggle
## Rows: 550 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Name, Author, Genre
## dbl (4): User Rating, Reviews, Price, Year
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
is.data.frame(amazon) #Check if the initial csv file has been loaded as a data frame
## [1] TRUE
class(amazon) #Check the class of amazon data set
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
amazon_df <- as.data.frame(amazon) #Save amazon data set as a data frame
is.data.frame(amazon_df) #Check if amazon_df is a data frame
## [1] TRUE
class(amazon_df) #Check the class of the amazon_df
## [1] "data.frame"
amazon_df %>% head(3) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #View the first 3 rows and scale down the size of output
| Name | Author | User Rating | Reviews | Price | Year | Genre |
|---|---|---|---|---|---|---|
| 10-Day Green Smoothie Cleanse | JJ Smith | 4.7 | 17350 | 8 | 2016 | Non Fiction |
| 11/22/63: A Novel | Stephen King | 4.6 | 2052 | 22 | 2011 | Fiction |
| 12 Rules for Life: An Antidote to Chaos | Jordan B. Peterson | 4.7 | 18979 | 15 | 2018 | Non Fiction |
First, I set my working directory and load the csv file which was downloaded from Kaggle using read_csv function with Readr package and assigned it to: amazon.
The type of columns “Name”, “Author” and “Genre” is “character” because those variables are strings, and the type of columns “User Rating”, “Reviews”, “Price” and “Year” is “double” because those variables contain numeric values. R is reading “Reviews”, “Price” and “Year” as double precision floating point numbers instead of integers, even though the values in those variables are whole numbers.
The data set amazon was initially loaded into R as a data frame, I checked it with is.data.frame() function which returns output = TRUE, however class(amazon) returns output: “spec_tbl_df” “tbl_df” “tbl” “data.frame”, this means that amazon is a tibble and a data frame at the same time. Normally I wouldn’t change anything, since tibble is also a data frame, but as required by this assignment to specifically save the data set as a data frame, I used as.data.frame() to coerce the data set to be a data frame and assigned it to a variable: amazon_df. class(amazon_df) now returns output: “data.frame”.
I used head(3) to view the first 3 rows of the data set and kableExtra package functions to scale down and condense the output of amazon_df.
I did not use stringsAsFactors=TRUE here because I used read_csv function instead of read.csv. R did not automatically treat strings as factors.
any(is.na(amazon_df)) #Check for any missing values in our data frame
## [1] FALSE
dim(amazon_df) #Check the dimensions of the data frame.
## [1] 550 7
colnames(amazon_df) #Check the column names in the data frame
## [1] "Name" "Author" "User Rating" "Reviews" "Price"
## [6] "Year" "Genre"
amazon_df <- amazon_df %>% rename(User_Rating = `User Rating`) #Rename "User Rating" to "User_Rating"
colnames(amazon_df) #View the after change column names
## [1] "Name" "Author" "User_Rating" "Reviews" "Price"
## [6] "Year" "Genre"
str(amazon_df) #Check the data types of the variables in the data set
## 'data.frame': 550 obs. of 7 variables:
## $ Name : chr "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" "12 Rules for Life: An Antidote to Chaos" "1984 (Signet Classics)" ...
## $ Author : chr "JJ Smith" "Stephen King" "Jordan B. Peterson" "George Orwell" ...
## $ User_Rating: num 4.7 4.6 4.7 4.7 4.8 4.4 4.7 4.7 4.7 4.6 ...
## $ Reviews : num 17350 2052 18979 21424 7665 ...
## $ Price : num 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : num 2016 2011 2018 2017 2019 ...
## $ Genre : chr "Non Fiction" "Fiction" "Non Fiction" "Fiction" ...
#Factorize the categorical variables: "Name", "Author", "Genre", they are non-ordered categorical variables
amazon_df$Name <- amazon_df$Name %>% as.factor()
amazon_df$Author <- amazon_df$Author %>% as.factor()
amazon_df$Genre <- amazon_df$Genre %>% as.factor()
#Factorize "User_Rating" column, label "4" to "4.0" and ordered the levels
amazon_df$User_Rating <- factor(amazon_df$User_Rating, labels=c("3.3", "3.6", "3.8", "3.9", "4.0", "4.1", "4.2", "4.3", "4.4", "4.5", "4.6", "4.7", "4.8", "4.9"), ordered=TRUE)
head(amazon_df$User_Rating) #Check that User_Rating has been ordered properly
## [1] 4.7 4.6 4.7 4.7 4.8 4.4
## 14 Levels: 3.3 < 3.6 < 3.8 < 3.9 < 4.0 < 4.1 < 4.2 < 4.3 < 4.4 < ... < 4.9
class(amazon_df$User_Rating) #Check the class of User_Rating
## [1] "ordered" "factor"
str(amazon_df$User_Rating) #Check the structure of User_Rating
## Ord.factor w/ 14 levels "3.3"<"3.6"<"3.8"<..: 12 11 12 12 13 9 12 12 12 11 ...
amazon_df$Year <- factor(amazon_df$Year, ordered=TRUE) #Factorize "Year" column, ordered the values within Year
head(amazon_df$Year) #Check that Year has been ordered properly
## [1] 2016 2011 2018 2017 2019 2011
## 11 Levels: 2009 < 2010 < 2011 < 2012 < 2013 < 2014 < 2015 < 2016 < ... < 2019
str(amazon_df) #Check the data types in our data frame again
## 'data.frame': 550 obs. of 7 variables:
## $ Name : Factor w/ 351 levels "10-Day Green Smoothie Cleanse",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Author : Factor w/ 248 levels "Abraham Verghese",..: 125 220 135 96 175 97 97 13 115 90 ...
## $ User_Rating: Ord.factor w/ 14 levels "3.3"<"3.6"<"3.8"<..: 12 11 12 12 13 9 12 12 12 11 ...
## $ Reviews : num 17350 2052 18979 21424 7665 ...
## $ Price : num 8 22 15 6 12 11 30 15 3 8 ...
## $ Year : Ord.factor w/ 11 levels "2009"<"2010"<..: 8 3 10 9 11 3 6 9 10 8 ...
## $ Genre : Factor w/ 2 levels "Fiction","Non Fiction": 2 1 2 1 2 1 1 1 2 1 ...
summary(amazon_df) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #Run the summary function
| Name | Author | User_Rating | Reviews | Price | Year | Genre | |
|---|---|---|---|---|---|---|---|
| Publication Manual of the American Psychological Association, 6th Edition : 10 | Jeff Kinney : 12 | 4.8 :127 | Min. : 37 | Min. : 0.0 | 2009 : 50 | Fiction :240 | |
| StrengthsFinder 2.0 : 9 | Gary Chapman : 11 | 4.7 :108 | 1st Qu.: 4058 | 1st Qu.: 7.0 | 2010 : 50 | Non Fiction:310 | |
| Oh, the Places You’ll Go! : 8 | Rick Riordan : 11 | 4.6 :105 | Median : 8580 | Median : 11.0 | 2011 : 50 | NA | |
| The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change: 7 | Suzanne Collins : 11 | 4.5 : 60 | Mean :11953 | Mean : 13.1 | 2012 : 50 | NA | |
| The Very Hungry Caterpillar : 7 | American Psychological Association: 10 | 4.9 : 52 | 3rd Qu.:17253 | 3rd Qu.: 16.0 | 2013 : 50 | NA | |
| Jesus Calling: Enjoying Peace in His Presence (with Scripture References) : 6 | Dr. Seuss : 9 | 4.4 : 38 | Max. :87841 | Max. :105.0 | 2014 : 50 | NA | |
| (Other) :503 | (Other) :486 | (Other): 60 | NA | NA | (Other):250 | NA |
#It seems like some books were repetitive under "Name"
length(unique(amazon_df$Name)) #Check for the number of unique values within variable "Name"
## [1] 351
sub1 <- subset(amazon_df, duplicated(amazon_df$Name)) #Subset duplicated "Names" values from amazon_df
sub1 %>% tail(3) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #View the last 3 rows
| Name | Author | User_Rating | Reviews | Price | Year | Genre | |
|---|---|---|---|---|---|---|---|
| 548 | You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life | Jen Sincero | 4.7 | 14331 | 8 | 2017 | Non Fiction |
| 549 | You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life | Jen Sincero | 4.7 | 14331 | 8 | 2018 | Non Fiction |
| 550 | You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life | Jen Sincero | 4.7 | 14331 | 8 | 2019 | Non Fiction |
Firstly, any(is.na(amazon_df)) returns output=FALSE, which means there are no missing values in the data set to deal with. Dimension of amazon_df returns: 550 rows and 7 columns (Column headers and row indexes are not counted). The column names of the original data set seemed fine, I renamed “User Rating” to “User_Rating” to make it easier to quote the variable later. I checked the data types of each column using the str() function: “Name”, “Author” and “Genre” are “character”, while “User_Rating”, “Reviews”, “Price” and “Year” are “numeric”.
I factorized all the strings in our data frame: “Name”, “Author” and “Genre”, they are non-ordered categorical variables, after factorizing, they are non-ordered factors.
“User_Rating” was read as “numeric” in R, but in fact it should be a categorical (ordinal) variable. I factorized “User_Rating”, renamed value “4” to “4.0”, and ordered the levels. Given that there is no missing values in my data set, I did not specifically state the levels, as R will automatically arrange the values in an ascending order, from the lowest rating “3.3” to the highest rating “4.9”. I checked whether “User_Rating” has been factorized and ordered correctly using head(), class() and str() functions. It is now an ordered factor with 14 levels “3.3”<“3.6”<….<“4.9”.
“Year” was also read as “numeric” in R. In this case, “Year” should be considered as a categorical variable because it represents data collected for books between 2009 to 2019 (it doesn’t make sense to perform any statistical analysis such as taking the mean/median value on “Year”, the ratio between two years is also not meaningful), therefore “Year” would be more appropriately classified as an ordered factor. I factorized “Year” and ordered them from 2009 to 2019: “2009”<“2010”<…<“2019”.
“Price” and “Reviews” are quantitative variables, so I left it as “numeric”.
I used str(amazon_df) to check that I have factorize and/or ordered the variables mentioned above correctly.
I ran the summary() function to check the frequencies for categorical variables and statistical summaries for numeric variables, It turns out that some books (“Names”) were repetitive. I checked for the number of unique values in “names”, the output returns 351, which means 550-351=199 book names were repeated. To check whether the books have duplicate values across all the columns, I used the subset() function to extract all the rows with duplicate values under the “Names” variable. Here I am just showing an example of a book “You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life” which were repeated 3 times in amazon_df, but has different values under “Year” column, which means that the book was ranked on Amazon for consecutive years 2017, 2018, 2019. Given that the values in “Price”, “Reviews” and “Ratings” are the same across those years, it could mean that they retained the values from the latest year which the book was ranked on Amazon. Given that they are not actually “duplicates”, I did not remove them from amazon_df.
amazon_df_sub <- amazon_df[1:10, ] #Subset the data frame using the first 10 observations, including all variables
amazon_df_sub %>% tail(3) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #View the last 3 rows
| Name | Author | User_Rating | Reviews | Price | Year | Genre | |
|---|---|---|---|---|---|---|---|
| 8 | A Gentleman in Moscow: A Novel | Amor Towles | 4.7 | 19699 | 15 | 2017 | Fiction |
| 9 | A Higher Loyalty: Truth, Lies, and Leadership | James Comey | 4.7 | 5983 | 3 | 2018 | Non Fiction |
| 10 | A Man Called Ove: A Novel | Fredrik Backman | 4.6 | 23848 | 8 | 2016 | Fiction |
amazon_matrix <- as.matrix(amazon_df_sub) #Convert the data frame above to a matrix
is.matrix(amazon_matrix) #Check if it is a matrix
## [1] TRUE
class(amazon_matrix) #Check the class of the matrix
## [1] "matrix" "array"
str(amazon_matrix) #Check the structure of the matrix
## chr [1:10, 1:7] "10-Day Green Smoothie Cleanse" "11/22/63: A Novel" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:10] "1" "2" "3" "4" ...
## ..$ : chr [1:7] "Name" "Author" "User_Rating" "Reviews" ...
typeof(amazon_matrix) #Check the type of the matrix
## [1] "character"
I subset amazon_df to the first 10 rows using the subset() function, and assigned this subset to: amazon_df_sub. tail(3) shows the last 3 rows of the subset which are rows 8,9,10. This shows that the subset definitely has 10 rows and all 7 variables (columns).
Then I converted the subset into a matrix using the as.matrix() function, and assigned this matrix to: amazon_matrix. I used is.matrix() function to check that the subset is now a matrix, the output returns TRUE. By running the class() function, I am checking the class of this matrix, the output returns: “matrix”, “array”, which means amazon_matrix is a matrix, equivalent to a two-dimensional array now.
I run the str() function to check the structure of the matrix and typeof() to check the type of the matrix. The output returns: chr which means the matrix is now a “character” matrix. This is because all elements of a matrix must be of the same class and of the same length. Since the elements within amazon_df_sub are of “numeric” and “factors” classes, R has coerced all the elements in the matrix to the most flexible class when as.matrix() function was applied to convert amazon_df_sub to amazon_matrix. In this case, “character” is more flexible than “numeric”, so the type of the matrix is now “character”. As for the condition of matrix to be of the same length, it is not an issue here because technically a data frame was already a list of equal-length vectors, so when I converted amazon_df_sub to a matrix, the length remains the same.
df <- data.frame(matrix(0, ncol = 2, nrow = 10)) #Create an empty data frame with 10 rows and 2 columns
set.seed(1) #Generate an integer variable: Age
Age <- sample(20L:35L, 10, replace=TRUE)
set.seed(1) #Generate an ordinal variable: Final_Assignment_Grade & order the variable
Final_Assignment_Grade <- factor(sample(c("0-49","50-59","60-69","70-79","80-100"),10, replace=TRUE), levels=c("0-49","50-59","60-69","70-79","80-100"), labels=c("Fail","Pass","Credit","Distinction","High Distinction"), ordered=TRUE)
df <- data.frame(Age,Final_Assignment_Grade) #Place two of the variables I have created into my dataframe: df
df$Age #View 10 observations in Age
## [1] 28 23 26 20 21 32 26 30 33 21
df$Final_Assignment_Grade #View the 10 observations in Final_Assignment_Grade and ordered levels
## [1] Fail Distinction Fail Pass
## [5] High Distinction Credit Pass Credit
## [9] Credit Fail
## Levels: Fail < Pass < Credit < Distinction < High Distinction
set.seed(1) #Generate a numeric variable: Final_Exam_GPA
gpa <- seq(from=0, to=4.0, by=.1)
Final_Exam_GPA <- sample(gpa, size=10, replace=TRUE)
df <- cbind(df,Final_Exam_GPA) #Bind the numeric variable: Final_Exam_GPA to my df using cbind()
df$Final_Exam_GPA #View the 10 observations in Final_Exam_GPA
## [1] 0.3 3.8 0.0 3.3 2.2 1.3 1.7 3.2 2.0 2.0
rownames(df) <- c("Student 1","Student 2","Student 3","Student 4","Student 5","Student 6","Student 7","Student 8","Student 9","Student 10") #Rename the row indexes of df
df %>% tail(3) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed") #View the last 3 rows of the df with all 3 variables
| Age | Final_Assignment_Grade | Final_Exam_GPA | |
|---|---|---|---|
| Student 8 | 30 | Credit | 3.2 |
| Student 9 | 33 | Credit | 2.0 |
| Student 10 | 21 | Fail | 2.0 |
str(df) #Check the final structure all the variables in df
## 'data.frame': 10 obs. of 3 variables:
## $ Age : int 28 23 26 20 21 32 26 30 33 21
## $ Final_Assignment_Grade: Ord.factor w/ 5 levels "Fail"<"Pass"<..: 1 4 1 2 5 3 2 3 3 1
## $ Final_Exam_GPA : num 0.3 3.8 0 3.3 2.2 1.3 1.7 3.2 2 2
I created an empty data frame with 10 rows and 2 columns: df to start with (though this step is not necessary). I am creating a data frame of 10 observations consisting of 10 individual students.
Note: Instead of setting the values for each variable myself, I am asking R to generate 10 random values for all the variables in df.
For all the variables created below, I used the set.seed() function so that it allows me to generate a sequence of random numbers and then reproduce that same sequence of random numbers after that.
I have set replace=TRUE to allow R to repeat any values in each random sample generated.
Firstly, I created an integer variable: Age, this stands for each student’s age. Here I have set the Age’s range to be 20 to 35 (which is normally the age range for university students). I added “L” behind the numbers to coerce that I want the class of the variable to be an integer, even though “21:35” will also give us an integer class.
Secondly, I created an ordinal variable: Final_Assignment_Grade, this stands for each student’s final assignment grade. I have factorized the variable using factor(), the levels of this variable consists of “0-49”,“50-59”,“60-69”,“70-79”,“80-100” which are the marks for students’ final assignment. I labelled these marks as “Grades”: “Fail”,“Pass”,“Credit”,“Distinction”,“High Distinction”, so that it makes more sense. This means “Fail” for “0-49”, “Pass” for “50-59”, “Credit” for “60-69”, “Distinction” for “70-79” and “High Distinction” for “80-100”. I have also ordered the grades: “Fail”<“Pass”<“Credit”<“Distinction”<“High Distinction”. Final_Assignment_Grade is an ordered factor with 5 levels from “Fail” the lowest grade to “High Distinction” the highest grade.
Then, I put both variables “Age” and “Final_Assignment_Grade” into df using the data.frame() function.
Thirdly, I created a numerical variable: Final_Exam_GPA, this stands for each student’s Final Exam GPA. I set the random values to range from 0 to 4.0, with an incremental difference value of 0.1 (which is typically the case for GPA). Here it is a numerical variable not an integer variable as GPA has 1 decimal point, they are not whole numbers.
Lastly, I used the cbind function to add the numerical variable: Final_Exam_GPA to my data frame: df. I renamed the row indexes of df from “Student 1” to “Student 10”. str(df) shows that df now consists of 10 observations and 3 variables: Age (integer), Final_Assignment_Grade (ordered factor) and Final_Exam_GPA (numeric).
Saalu, S. (2020). Amazon Top 50 Bestselling Books 2009 - 2019. https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019
Taheri, S. (2021). Module 2 Get: Importing, Scraping and Exporting Data with R [Module Webpage]. Canvas @ RMIT University, http://rare-phoenix-161610.appspot.com/secured/Module_02.html
Taheri, S. (2021). Module 3 Understand: Understanding Data and Data Structures [Module Webpage]. Canvas @ RMIT University. http://rare-phoenix-161610.appspot.com/secured/Module_03.html