options(scipen = 10)
library(ggplot2)
library(tidyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
library(car)
## Loading required package: carData
hmdata <- read.csv("~/IMB/R/HW1 dataset.csv")
head(hmdata)
## bookID
## 1 1
## 2 2
## 3 4
## 4 5
## 5 8
## 6 9
## title
## 1 Harry Potter and the Half-Blood Prince (Harry Potter #6)
## 2 Harry Potter and the Order of the Phoenix (Harry Potter #5)
## 3 Harry Potter and the Chamber of Secrets (Harry Potter #2)
## 4 Harry Potter and the Prisoner of Azkaban (Harry Potter #3)
## 5 Harry Potter Boxed Set Books 1-5 (Harry Potter #1-5)
## 6 Unauthorized Harry Potter Book Seven News: "Half-Blood Prince" Analysis and Speculation
## authors average_rating isbn isbn13
## 1 J.K. Rowling/Mary GrandPré 4.57 439785960 9780440000000
## 2 J.K. Rowling/Mary GrandPré 4.49 439358078 9780440000000
## 3 J.K. Rowling 4.42 439554896 9780440000000
## 4 J.K. Rowling/Mary GrandPré 4.56 043965548X 9780440000000
## 5 J.K. Rowling/Mary GrandPré 4.78 439682584 9780440000000
## 6 W. Frederick Zimmerman 3.74 976540606 9780980000000
## language_code num_pages ratings_count text_reviews_count publication_date
## 1 eng 652 2095690 27591 9/16/2006
## 2 eng 870 2153167 29221 9/1/2004
## 3 eng 352 6333 244 11/1/2003
## 4 eng 435 2339585 36325 5/1/2004
## 5 eng 2690 41428 164 9/13/2004
## 6 en-US 152 19 1 4/26/2005
## publisher
## 1 Scholastic Inc.
## 2 Scholastic Inc.
## 3 Scholastic
## 4 Scholastic Inc.
## 5 Scholastic
## 6 Nimble Books
The unit of observation in this data set are published books, while the sample size is 11123 number of books. The numeric (ratio) variables are measured based on the reviews and ratings left by the readers, number of pages and the language a book was published in.
Description of variables:
This dataset was obtained from Goodreads API (https://www.goodreads.com/api), which was last updated and available for the public in December 2020.
The aim of this data analysis is to find if there is a relationship between the different attributes of the books that effect the rating readers give.
Research questions are the following: What is the relationship between the average ratings and the number of pages? What is the difference between the average ratings of books in English and non-English languages?
Removing variables that are irrelevant for this data analysis.
hmdata <- subset(hmdata, select = -c(isbn, isbn13, publisher, publication_date))
Searching for potential missing values that could effect later analysis:
colSums(is.na(hmdata))
## bookID title authors average_rating
## 0 0 0 0
## language_code num_pages ratings_count text_reviews_count
## 0 0 0 0
Establish 2 categories for the nominal variable displaying language of the books.
hmdata$language_code <- ifelse(hmdata$language_code == "eng" | hmdata$language_code == "en-US" | hmdata$language_code == "en-CA" | hmdata$language_code == "en-GB", "eng", "non eng")
Due to the large data set, the str() function was used to display the internal structure:
str(hmdata)
## 'data.frame': 11123 obs. of 8 variables:
## $ bookID : int 1 2 4 5 8 9 10 12 13 14 ...
## $ title : chr "Harry Potter and the Half-Blood Prince (Harry Potter #6)" "Harry Potter and the Order of the Phoenix (Harry Potter #5)" "Harry Potter and the Chamber of Secrets (Harry Potter #2)" "Harry Potter and the Prisoner of Azkaban (Harry Potter #3)" ...
## $ authors : chr "J.K. Rowling/Mary GrandPré" "J.K. Rowling/Mary GrandPré" "J.K. Rowling" "J.K. Rowling/Mary GrandPré" ...
## $ average_rating : num 4.57 4.49 4.42 4.56 4.78 3.74 4.73 4.38 4.38 4.22 ...
## $ language_code : chr "eng" "eng" "eng" "eng" ...
## $ num_pages : int 652 870 352 435 2690 152 3342 815 815 215 ...
## $ ratings_count : int 2095690 2153167 6333 2339585 41428 19 28242 3628 249558 4930 ...
## $ text_reviews_count: int 27591 29221 244 36325 164 1 808 254 4080 460 ...
Statistical description of numerical variables rounded to 2 decimals:
round(stat.desc(hmdata[c(4, 6, 7, 8)]), 2)
## average_rating num_pages ratings_count text_reviews_count
## nbr.val 11123.00 11123.00 11123.00 11123.00
## nbr.null 25.00 76.00 80.00 624.00
## nbr.na 0.00 0.00 0.00 0.00
## min 0.00 0.00 0.00 0.00
## max 5.00 6576.00 4597666.00 94265.00
## range 5.00 6576.00 4597666.00 94265.00
## sum 43758.72 3741839.00 199578299.00 6029201.00
## median 3.96 299.00 745.00 47.00
## mean 3.93 336.41 17942.85 542.05
## SE.mean 0.00 2.29 1066.69 24.43
## CI.mean.0.95 0.01 4.48 2090.90 47.89
## var 0.12 58154.59 12656059531.66 6638968.51
## std.dev 0.35 241.15 112499.15 2576.62
## coef.var 0.09 0.72 6.27 4.75
Interpretation: The average rating was 3.93 and the standard error was low for the variable, meaning that the sample represents the population well. The text reviews accounted for only 2% of all reviews (ratings_count), meaning that most reviewers only give a score without a comment. High coefficient variance and the standard deviation of ratings count and text reviews count is expected, as the popularity and knowledge of the existence of specific books varies. This is also show in large range. However, this can present an issue when looking for a relationship between the variables.
The ration of distribution visualisation:
ggplot(hmdata, aes(x= average_rating)) +
geom_histogram(aes(y=after_stat(density)),
binwidth = 0.09, colour = "black", fill = "rosybrown1",) +
ggtitle("The distribution of average ratings") +
geom_density(lwd = 1,
linetype = 1,
colour = "blue")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
The plot above shows that most ratings lie approximately between 3.5 and 4.5, while the rating 5 and 1 are less common. It is a left-skewed distribution
Scatter plot to display the relationship between the number of pages and average rating
scatterplot(y = hmdata$num_pages, x = hmdata$average_rating,
main = "The relationship between the number of pages and average rating",
ylab = "Number of pages per book",
xlab = "Average rating",
smooth = FALSE)
The highest concentration of the ratings was for books that page number does not exceed 1000. Due to the outliers with pages above 1000, the information is not definite. However, based on the box plot on the sides, most of the reviewers gave the highest ratings to books with page number of less than 500.
Box plot shows the average rating of books based on languege:
ggplot(hmdata, aes(as.factor(language_code), average_rating))+
geom_boxplot()+
ggtitle("Average rating of books based on language")+
xlab("Language")+
theme(plot.title = element_text(hjust = 0.5))
The plot above shows how reviews were distributed for books written in English and non-English language. The median is slightly lower for the non-English books while the for the ones written in English, there are more outliers that deviated from the median, giving scores between 0 and 3. It is worth mentioning that the number of ratings from books in English is significantly higher than for those that are not.