Once you finalise your report, run all R chunks and Preview your notebook in HTML (by clicking Preview). Make sure your code and outputs are visible.
3 . Publish the report to RPubs (see here) and enter your report’s RPubs URL into the Website URL tab under Assignment 2 RPubs Link Submission page in Canvas (see instructions file for details) and submit this too. This online version of the report will be used for marking. Failure to submit your link will delay your feedback and risk late penalties.
Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
# This is the R chunk for the required packages
library(outliers)
## Warning: package 'outliers' was built under R version 4.0.3
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(stringr)
## Warning: package 'stringr' was built under R version 4.0.3
library(editrules)
## Warning: package 'editrules' was built under R version 4.0.3
## Loading required package: igraph
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:lubridate':
##
## %--%, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
##
## Attaching package: 'editrules'
## The following objects are masked from 'package:igraph':
##
## blocks, normalize
library(ggplot2)
library(car)
## Loading required package: carData
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:editrules':
##
## contains
## The following objects are masked from 'package:igraph':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following objects are masked from 'package:editrules':
##
## contains, separate
## The following object is masked from 'package:igraph':
##
## crossing
library(deducorrect)
## Warning: package 'deducorrect' was built under R version 4.0.3
library(validate)
##
## Attaching package: 'validate'
## The following object is masked from 'package:dplyr':
##
## expr
## The following object is masked from 'package:ggplot2':
##
## expr
## The following object is masked from 'package:igraph':
##
## compare
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
The purpose of this investigation was to understand what kind of relationship exists between different reviewer demographics respond to the top 3 most popular anime genre by anlaysing their user scores. First I imported datasets from 3 different csv files. Then I reviewed the merged dataframe and removed unrequired observations and variables. After that the dataset was tidied up according to the tidy principles. And finally, both the missing values and outliners are checked and removed together with other processes as the assignment requirements.
This dataset contains informations about Anime (16k), Reviews (130k) and Profiles (47k) crawled from open source database https://www.kaggle.com/marlesson/myanimelist-dataset-animes-profiles-reviews?select=animes.csv.
The dataset contains 3 files:
animes.csv contains list of anime, with title, title synonyms, genre, duration, rank, popularity, score, airing date, episodes and many other important data about individual anime providing sufficient information about trends in time about important aspects of anime. Rank is in float format in csv, but it contains only integer value. This is due to NaN values and their representation in pandas.
reviews.csv contains information about reviews users x animes, with text review and scores.
profiles.csv contains information about users who watch anime, namely username, birth date, gender, and favorite animes list.#
animes <- read_csv("C:/Users/Andrew Liu/OneDrive/3_Edu_Development_Training/Postgraduate RMIT/MATH 2349 Data Wrangling/Assignment 3/Anime/animes.csv");
## Parsed with column specification:
## cols(
## uid = col_double(),
## title = col_character(),
## synopsis = col_character(),
## genre = col_character(),
## aired = col_character(),
## episodes = col_double(),
## members = col_double(),
## popularity = col_double(),
## ranked = col_double(),
## score = col_double(),
## img_url = col_character(),
## link = col_character()
## )
#According to data descriptions, the "uid" column in animes is same as column "anime_uid" in the review dataset, hence the name of col in animes needs to be changed to march with that in reviews to become a common key.
animes <- rename(animes,anime_uid=uid);
#Based on observation, numerous duplicates were found in the dataset hence use distinct() to remove them.
animes<-animes %>% distinct(anime_uid, .keep_all= TRUE)
reviews <- read_csv("C:/Users/Andrew Liu/OneDrive/3_Edu_Development_Training/Postgraduate RMIT/MATH 2349 Data Wrangling/Assignment 3/Anime/reviews.csv");
## Parsed with column specification:
## cols(
## uid = col_double(),
## profile = col_character(),
## anime_uid = col_double(),
## text = col_character(),
## score = col_double(),
## scores = col_character(),
## link = col_character()
## )
reviews<-reviews %>% distinct(uid, .keep_all= TRUE)
profiles <- read_csv("C:/Users/Andrew Liu/OneDrive/3_Edu_Development_Training/Postgraduate RMIT/MATH 2349 Data Wrangling/Assignment 3/Anime/profiles.csv");
## Parsed with column specification:
## cols(
## profile = col_character(),
## gender = col_character(),
## birthday = col_character(),
## favorites_anime = col_character(),
## link = col_character()
## )
#Datasets Display
#We have used dim() method for getting the dimentions of imported datasets. and head() method to display few rows of dataset.
dim(animes)
## [1] 16216 12
head(animes,3)
dim(profiles)
## [1] 81727 5
head(profiles,3)
dim(reviews)
## [1] 130519 7
head(reviews,3)
#Merging three Datasets
#The reviews and animes datasets are first joined via inner_join() method over the common key “anime_uid”. The third dataset profiles contains unique value profile and it was left joined to the newly merged data frame. Again, distinct() was performed to remove duplicates formed by left-joint from data1.
data1<-inner_join(animes, reviews, by = "anime_uid");
data1<-left_join(data1, profiles, by = "profile");
data1<-data1 %>% distinct(uid, .keep_all= TRUE)
#data1 contains many unused information hence they are removed by subsetting the data1 and over-write itself.
data1<-select(data1, anime_uid,genre,birthday, uid, score.y, gender);
#A simple data tidying was performed to ensure each attribute is in correct format and empty observations of interest are removed and saved the data frame as tidy1 for further investigation and manipulation.
tidy1<-filter(data1, !is.na(gender), !is.na(birthday),gender!="Non-Binary");
tidy1$anime_uid<- tidy1$anime_uid%>% as.character()
tidy1$uid<- tidy1$uid%>% as.character()
tidy1$gender<-tidy1$gender %>% as.factor();
# This is the R chunk for the Tidy & Manipulate Data I
#Based on observation of data frame, it can be seen that both the birthdays were in untidy format as it was recorded in irregular format. Since the purpose was to investigate review demographic of top 3 most popular anime genres, year of birth is more suitable. Some obs contained wrong values hence 387 failed to parse. These entries will be removed in following steps.
tidy1$birthday <- year(parse_date_time(tidy1$birthday, orders = c("mdy","y", "my","ym", "dmy")));
## Warning: 387 failed to parse.
tidy1<-rename(tidy1, birthyear=birthday);
# This is the R chunk for the Tidy & Manipulate Data II
# Because the genre col contained multiple observations in one row, it needed to be tidied up. First it needed to be separated into 3 columns (take the 3 most representative genre of each anime, if one only had 1 genre then it will be recorded with other two values as NA, which will be removed)
tidy2<-tidy1%>% tidyr::separate(genre,into = c("col1","col2","col3"), sep =",")
## Warning: Expected 3 pieces. Additional pieces discarded in 52628 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 8447 rows [1765,
## 1766, 1767, 1768, 1769, 1770, 1771, 1772, 1773, 1774, 1775, 1776, 1777, 1778,
## 1779, 1780, 1781, 1782, 1783, 1784, ...].
tidy3<-select(tidy2, anime_uid, c(2:8))
# This step converse long table formed by 3 cols(col1, col2 & col3) into r1 (discarded) and Genre1.
tidy3<-select(pivot_longer(tidy2,cols = 2:4,names_to = "r1",values_to = "Genre1"), -r1)
# Because the data contained punctuation, this step cleaned the Genre1 off these noise.
tidy3$Genre1<-gsub('[[:punct:] ]+',' ',tidy3$Genre1)
# Because only the top 3 popular (most reviewed) genres are interested, other genres are changed to "Other"
tidy3$Genre1[which(!tidy3$Genre1 %in% names(rev(sort(table(tidy3$Genre1)))[1:4]))] <- "Other"
# Genre needs to be assigned as factor.
tidy3$Genre1<-tidy3$Genre1 %>% as.factor();
# This is the R chunk for the Scan I
# This step was trying to find out the obvious errors in the dataset. As the data description stated the score should be in the range of 1 to 10, any obs outside of this range should be removed. Also year of birth obviously cannot be more than 2020 but due to this investigation was focused on adult anime viewers, 2002 was set as the threshold.
summary(tidy3)
## anime_uid birthyear uid score.y
## Length:225219 Min. :1930 Length:225219 Min. : 0.000
## Class :character 1st Qu.:1991 Class :character 1st Qu.: 6.000
## Mode :character Median :1995 Mode :character Median : 8.000
## Mean :1996 Mean : 7.428
## 3rd Qu.:1999 3rd Qu.: 9.000
## Max. :2059 Max. :11.000
## NA's :1161
## gender Genre1
## Female: 57168 Action : 30152
## Male :168051 Adventure : 14596
## Comedy : 30402
## Drama : 16345
## Other :133724
##
##
# Need to fix birthyear and score
Rules <- editfile("Editrules.txt", type = "all")
V_t3<-violatedEdits(Rules,tidy3)
summary(V_t3)
## Edit violations, 225219 observations, 0 completely missing (0%):
##
## editname freq rel
## num2 36405 16.2%
## num1 1716 0.8%
## num3 3 0%
## num4 3 0%
##
## Edit violations per record:
##
## errors freq rel
## 0 185931 82.6%
## 1 38127 16.9%
## 2 1161 0.5%
Rules1 <- correctionRules('Editrules.txt')
#CorrectwithRules causes Rstudio freezing hence proceed to manual correction
tidy4<-tidy3[(tidy3$birthyear <= 2002) & (tidy3$birthyear >= 1960),]
tidy5<-tidy4[(tidy4$score.y <= 10) & (tidy4$score.y > 0),]
tidy6<-na.omit(tidy5)
# This is the R chunk for the Scan II
# Checking for outliers using boxplot(). It can be seen from figures below, User score figure had one likely outlier. However, given the fact that, in the real world, some reviewers do give 1 score in reviews, it was not removed as outlier. Reviews demographic figure on the otherhand clearly showed there are many outliers.
par(mfrow=c(1,2))
tidy6$birthyear%>% boxplot(main='Reviews Demographic')
tidy6$score.y%>% boxplot(main='User score')
# Z-score test and Tukey method was implemented to identify and remove these outliers in the Reviwers demographic.
z.scores <- tidy6$birthyear %>% scores(type = "z")
z.scores %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.9244 -0.5808 0.1317 0.0000 0.6661 1.5567
length (which( abs(z.scores) >3 ))
## [1] 3216
# subset values that have absolute value of z-scores less than 3
tidy7<-subset(tidy6,subset = abs(z.scores) <3 )
# most outliers that may due to input errors have been removed
par(mfrow=c(1,2))
tidy7$birthyear%>% boxplot(main='Reviews Demographic')
tidy7$score.y%>% boxplot(main='User score')
# This is the R chunk for the Transform Section
# Checking for normality using hist().
par(mfrow=c(1,2))
tidy7$birthyear %>% hist(main='Demographic distr.', xlab = 'Year of birth')
tidy7$score.y %>% hist(main='User score distr.', xlab = 'Score 1 to 10')
plot(tidy7$score.y,tidy7$birthyear)
# Unfortunately there is no visible relationship between user score and birthyear
par(mfrow=c(1,2))
birthyear1 <- BoxCox(tidy7$birthyear, lambda = 'auto')
birthyear1 %>% hist(main='Transformed Demographic distr.')
score.y1 <- BoxCox(tidy7$score.y, lambda = 'auto')
score.y1 %>% hist(main='Transformed User score distr.')
# As the two transformed figures below show, the Box-Cox transformation has reduced left skewness in User score distr. but it was less successful in the other one. For statistical investigations, square transformation can be applied to further normalise the Demographic distr. However due to limited computing power (Box-Cox computation took me 3 hrs...) this has not been tested with this data in this assignment.