Homework 1

options(scipen = 10)
library(ggplot2)
library(tidyr)
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(car)

## Loading required package: carData

hmdata <- read.csv("~/IMB/R/HW1 dataset.csv")
head(hmdata)

##   bookID
## 1      1
## 2      2
## 3      4
## 4      5
## 5      8
## 6      9
##                                                                                     title
## 1                               Harry Potter and the Half-Blood Prince (Harry Potter  #6)
## 2                            Harry Potter and the Order of the Phoenix (Harry Potter  #5)
## 3                              Harry Potter and the Chamber of Secrets (Harry Potter  #2)
## 4                             Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)
## 5                                  Harry Potter Boxed Set  Books 1-5 (Harry Potter  #1-5)
## 6 Unauthorized Harry Potter Book Seven News: "Half-Blood Prince" Analysis and Speculation
##                      authors average_rating       isbn        isbn13
## 1 J.K. Rowling/Mary GrandPré           4.57  439785960 9780440000000
## 2 J.K. Rowling/Mary GrandPré           4.49  439358078 9780440000000
## 3               J.K. Rowling           4.42  439554896 9780440000000
## 4 J.K. Rowling/Mary GrandPré           4.56 043965548X 9780440000000
## 5 J.K. Rowling/Mary GrandPré           4.78  439682584 9780440000000
## 6     W. Frederick Zimmerman           3.74  976540606 9780980000000
##   language_code num_pages ratings_count text_reviews_count publication_date
## 1           eng       652       2095690              27591        9/16/2006
## 2           eng       870       2153167              29221         9/1/2004
## 3           eng       352          6333                244        11/1/2003
## 4           eng       435       2339585              36325         5/1/2004
## 5           eng      2690         41428                164        9/13/2004
## 6         en-US       152            19                  1        4/26/2005
##         publisher
## 1 Scholastic Inc.
## 2 Scholastic Inc.
## 3      Scholastic
## 4 Scholastic Inc.
## 5      Scholastic
## 6    Nimble Books

Data description

The unit of observation in this data set are published books, while the sample size is 11123 number of books. The numeric (ratio) variables are measured based on the reviews and ratings left by the readers, number of pages and the language a book was published in.

Description of variables:

bookID: unique ID for each book/series
title:the titles of the books
authors: author of the book
average_rating: the average rating of the books, as decided by the users
ISBN: ISBN(10) number, tells the information about a book
ISBN 13: The new format for ISBN, implemented in 2007. 13 digits
language_code: the language a book was written in
Num_pages: the number of pages of the book
Ratings_count: the number of ratings given for the book
text_reviews_count: the number of reviews left by users
publication_date: the date a book was published on
publisher: the publisher’s name

This dataset was obtained from Goodreads API (https://www.goodreads.com/api), which was last updated and available for the public in December 2020.

Purpose of the data analysis

The aim of this data analysis is to find if there is a relationship between the different attributes of the books that effect the rating readers give.

Research questions are the following: What is the relationship between the average ratings and the number of pages? What is the difference between the average ratings of books in English and non-English languages?

Data manipulations

Removing variables that are irrelevant for this data analysis.

hmdata <- subset(hmdata, select = -c(isbn, isbn13, publisher, publication_date))

Searching for potential missing values that could effect later analysis:

colSums(is.na(hmdata))

##             bookID              title            authors     average_rating 
##                  0                  0                  0                  0 
##      language_code          num_pages      ratings_count text_reviews_count 
##                  0                  0                  0                  0

Establish 2 categories for the nominal variable displaying language of the books.

hmdata$language_code <- ifelse(hmdata$language_code == "eng" | hmdata$language_code == "en-US" | hmdata$language_code == "en-CA" | hmdata$language_code == "en-GB", "eng", "non eng")

Due to the large data set, the str() function was used to display the internal structure:

str(hmdata)

## 'data.frame':    11123 obs. of  8 variables:
##  $ bookID            : int  1 2 4 5 8 9 10 12 13 14 ...
##  $ title             : chr  "Harry Potter and the Half-Blood Prince (Harry Potter  #6)" "Harry Potter and the Order of the Phoenix (Harry Potter  #5)" "Harry Potter and the Chamber of Secrets (Harry Potter  #2)" "Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)" ...
##  $ authors           : chr  "J.K. Rowling/Mary GrandPré" "J.K. Rowling/Mary GrandPré" "J.K. Rowling" "J.K. Rowling/Mary GrandPré" ...
##  $ average_rating    : num  4.57 4.49 4.42 4.56 4.78 3.74 4.73 4.38 4.38 4.22 ...
##  $ language_code     : chr  "eng" "eng" "eng" "eng" ...
##  $ num_pages         : int  652 870 352 435 2690 152 3342 815 815 215 ...
##  $ ratings_count     : int  2095690 2153167 6333 2339585 41428 19 28242 3628 249558 4930 ...
##  $ text_reviews_count: int  27591 29221 244 36325 164 1 808 254 4080 460 ...

Descriptive statistics and visualisation

Statistical description of numerical variables rounded to 2 decimals:

round(stat.desc(hmdata[c(4, 6, 7, 8)]), 2)

##              average_rating  num_pages  ratings_count text_reviews_count
## nbr.val            11123.00   11123.00       11123.00           11123.00
## nbr.null              25.00      76.00          80.00             624.00
## nbr.na                 0.00       0.00           0.00               0.00
## min                    0.00       0.00           0.00               0.00
## max                    5.00    6576.00     4597666.00           94265.00
## range                  5.00    6576.00     4597666.00           94265.00
## sum                43758.72 3741839.00   199578299.00         6029201.00
## median                 3.96     299.00         745.00              47.00
## mean                   3.93     336.41       17942.85             542.05
## SE.mean                0.00       2.29        1066.69              24.43
## CI.mean.0.95           0.01       4.48        2090.90              47.89
## var                    0.12   58154.59 12656059531.66         6638968.51
## std.dev                0.35     241.15      112499.15            2576.62
## coef.var               0.09       0.72           6.27               4.75

Interpretation: The average rating was 3.93 and the standard error was low for the variable, meaning that the sample represents the population well. The text reviews accounted for only 2% of all reviews (ratings_count), meaning that most reviewers only give a score without a comment. High coefficient variance and the standard deviation of ratings count and text reviews count is expected, as the popularity and knowledge of the existence of specific books varies. This is also show in large range. However, this can present an issue when looking for a relationship between the variables.

The ration of distribution visualisation:

ggplot(hmdata, aes(x= average_rating)) + 
  geom_histogram(aes(y=after_stat(density)), 
                 binwidth = 0.09, colour = "black", fill = "rosybrown1",) +
  ggtitle("The distribution of average ratings") +
  geom_density(lwd = 1,
               linetype = 1,
               colour = "blue")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

The plot above shows that most ratings lie approximately between 3.5 and 4.5, while the rating 5 and 1 are less common. It is a left-skewed distribution

Scatter plot to display the relationship between the number of pages and average rating

scatterplot(y = hmdata$num_pages, x = hmdata$average_rating,
            main = "The relationship between the number of pages and average rating",
            ylab = "Number of pages per book",
            xlab = "Average rating",
            smooth = FALSE)

The highest concentration of the ratings was for books that page number does not exceed 1000. Due to the outliers with pages above 1000, the information is not definite. However, based on the box plot on the sides, most of the reviewers gave the highest ratings to books with page number of less than 500.

Box plot shows the average rating of books based on languege:

ggplot(hmdata, aes(as.factor(language_code), average_rating))+
  geom_boxplot()+
  ggtitle("Average rating of books based on language")+
  xlab("Language")+
  theme(plot.title = element_text(hjust = 0.5))

The plot above shows how reviews were distributed for books written in English and non-English language. The median is slightly lower for the non-English books while the for the ones written in English, there are more outliers that deviated from the median, giving scores between 0 and 3. It is worth mentioning that the number of ratings from books in English is significantly higher than for those that are not.

Homework 1

Hana Murovec

2023-01-05

Data description

Purpose of the data analysis

Data manipulations

Descriptive statistics and visualisation