MATH2349 Data Wrangling

Submission Steps:

Once you finalise your report, run all R chunks and Preview your notebook in HTML (by clicking Preview). Make sure your code and outputs are visible.

3 . Publish the report to RPubs (see here) and enter your report’s RPubs URL into the Website URL tab under Assignment 2 RPubs Link Submission page in Canvas (see instructions file for details) and submit this too. This online version of the report will be used for marking. Failure to submit your link will delay your feedback and risk late penalties.

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages
library(outliers)

## Warning: package 'outliers' was built under R version 4.0.3

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(stringr)

## Warning: package 'stringr' was built under R version 4.0.3

library(editrules)

## Warning: package 'editrules' was built under R version 4.0.3

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:lubridate':
## 
##     %--%, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

library(ggplot2)
library(car)

## Loading required package: carData

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following object is masked from 'package:editrules':
## 
##     contains

## The following objects are masked from 'package:igraph':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following objects are masked from 'package:editrules':
## 
##     contains, separate

## The following object is masked from 'package:igraph':
## 
##     crossing

library(deducorrect)

## Warning: package 'deducorrect' was built under R version 4.0.3

library(validate)

## 
## Attaching package: 'validate'

## The following object is masked from 'package:dplyr':
## 
##     expr

## The following object is masked from 'package:ggplot2':
## 
##     expr

## The following object is masked from 'package:igraph':
## 
##     compare

library(forecast)

## Warning: package 'forecast' was built under R version 4.0.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

The purpose of this investigation was to understand what kind of relationship exists between different reviewer demographics respond to the top 3 most popular anime genre by anlaysing their user scores. First I imported datasets from 3 different csv files. Then I reviewed the merged dataframe and removed unrequired observations and variables. After that the dataset was tidied up according to the tidy principles. And finally, both the missing values and outliners are checked and removed together with other processes as the assignment requirements.

Data

This dataset contains informations about Anime (16k), Reviews (130k) and Profiles (47k) crawled from open source database https://www.kaggle.com/marlesson/myanimelist-dataset-animes-profiles-reviews?select=animes.csv.

The dataset contains 3 files:

animes.csv contains list of anime, with title, title synonyms, genre, duration, rank, popularity, score, airing date, episodes and many other important data about individual anime providing sufficient information about trends in time about important aspects of anime. Rank is in float format in csv, but it contains only integer value. This is due to NaN values and their representation in pandas.

reviews.csv contains information about reviews users x animes, with text review and scores.

profiles.csv contains information about users who watch anime, namely username, birth date, gender, and favorite animes list.#

animes <- read_csv("C:/Users/Andrew Liu/OneDrive/3_Edu_Development_Training/Postgraduate RMIT/MATH 2349 Data Wrangling/Assignment 3/Anime/animes.csv");

## Parsed with column specification:
## cols(
##   uid = col_double(),
##   title = col_character(),
##   synopsis = col_character(),
##   genre = col_character(),
##   aired = col_character(),
##   episodes = col_double(),
##   members = col_double(),
##   popularity = col_double(),
##   ranked = col_double(),
##   score = col_double(),
##   img_url = col_character(),
##   link = col_character()
## )

#According to data descriptions, the "uid" column in animes is same as column "anime_uid" in the review dataset, hence the name of col in animes needs to be changed to march with that in reviews to become a common key.
animes <- rename(animes,anime_uid=uid);
#Based on observation, numerous duplicates were found in the dataset hence use distinct() to remove them.
animes<-animes %>% distinct(anime_uid, .keep_all= TRUE)                     
reviews <- read_csv("C:/Users/Andrew Liu/OneDrive/3_Edu_Development_Training/Postgraduate RMIT/MATH 2349 Data Wrangling/Assignment 3/Anime/reviews.csv");

## Parsed with column specification:
## cols(
##   uid = col_double(),
##   profile = col_character(),
##   anime_uid = col_double(),
##   text = col_character(),
##   score = col_double(),
##   scores = col_character(),
##   link = col_character()
## )

reviews<-reviews %>% distinct(uid, .keep_all= TRUE)

profiles <- read_csv("C:/Users/Andrew Liu/OneDrive/3_Edu_Development_Training/Postgraduate RMIT/MATH 2349 Data Wrangling/Assignment 3/Anime/profiles.csv");

## Parsed with column specification:
## cols(
##   profile = col_character(),
##   gender = col_character(),
##   birthday = col_character(),
##   favorites_anime = col_character(),
##   link = col_character()
## )

#Datasets Display
#We have used dim() method for getting the dimentions of imported datasets. and head() method to display few rows of dataset.
dim(animes)

## [1] 16216    12

head(animes,3)

dim(profiles)

## [1] 81727     5

head(profiles,3)

dim(reviews)

## [1] 130519      7

head(reviews,3)

#Merging three Datasets
#The reviews and animes datasets are first joined via inner_join() method over the common key “anime_uid”. The third dataset profiles contains unique value profile and it was left joined to the newly merged data frame. Again, distinct() was performed to remove duplicates formed by left-joint from data1.

data1<-inner_join(animes, reviews, by = "anime_uid");
data1<-left_join(data1, profiles, by = "profile");
data1<-data1 %>% distinct(uid, .keep_all= TRUE)

#data1 contains many unused information hence they are removed by subsetting the data1 and over-write itself.
data1<-select(data1, anime_uid,genre,birthday, uid, score.y, gender);
#A simple data tidying was performed to ensure each attribute is in correct format and empty observations of interest are removed and saved the data frame as tidy1 for further investigation and manipulation. 
tidy1<-filter(data1, !is.na(gender), !is.na(birthday),gender!="Non-Binary");
tidy1$anime_uid<- tidy1$anime_uid%>% as.character()
tidy1$uid<- tidy1$uid%>% as.character()
tidy1$gender<-tidy1$gender %>% as.factor();

Tidy & Manipulate Data I

# This is the R chunk for the Tidy & Manipulate Data I 
#Based on observation of data frame, it can be seen that both the birthdays were in untidy format as it was recorded in irregular format. Since the purpose was to investigate review demographic of top 3 most popular anime genres, year of birth is more suitable. Some obs contained wrong values hence 387 failed to parse. These entries will be removed in following steps.
tidy1$birthday <- year(parse_date_time(tidy1$birthday, orders = c("mdy","y", "my","ym", "dmy")));

## Warning: 387 failed to parse.

tidy1<-rename(tidy1, birthyear=birthday);

Tidy & Manipulate Data II

# This is the R chunk for the Tidy & Manipulate Data II 
# Because the genre col contained multiple observations in one row, it needed to be tidied up. First it needed to be separated into 3 columns (take the 3 most representative genre of each anime, if one only had 1 genre then it will be recorded with other two values as NA, which will be removed)
tidy2<-tidy1%>% tidyr::separate(genre,into = c("col1","col2","col3"), sep =",")

## Warning: Expected 3 pieces. Additional pieces discarded in 52628 rows [1, 2, 3,
## 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 8447 rows [1765,
## 1766, 1767, 1768, 1769, 1770, 1771, 1772, 1773, 1774, 1775, 1776, 1777, 1778,
## 1779, 1780, 1781, 1782, 1783, 1784, ...].

tidy3<-select(tidy2, anime_uid, c(2:8))

# This step converse long table formed by 3 cols(col1, col2 & col3) into r1 (discarded) and Genre1. 
tidy3<-select(pivot_longer(tidy2,cols = 2:4,names_to = "r1",values_to = "Genre1"), -r1)
# Because the data contained punctuation, this step cleaned the Genre1 off these noise. 
tidy3$Genre1<-gsub('[[:punct:] ]+',' ',tidy3$Genre1)
# Because only the top 3 popular (most reviewed) genres are interested, other genres are changed to "Other"
tidy3$Genre1[which(!tidy3$Genre1 %in% names(rev(sort(table(tidy3$Genre1)))[1:4]))] <- "Other"
# Genre needs to be assigned as factor.
tidy3$Genre1<-tidy3$Genre1 %>% as.factor();

Scan I

# This is the R chunk for the Scan I
# This step was trying to find out the obvious errors in the dataset. As the data description stated the score should be in the range of 1 to 10, any obs outside of this range should be removed. Also year of birth obviously cannot be more than 2020 but due to this investigation was focused on adult anime viewers, 2002 was set as the threshold. 
summary(tidy3)

##   anime_uid           birthyear        uid               score.y      
##  Length:225219      Min.   :1930   Length:225219      Min.   : 0.000  
##  Class :character   1st Qu.:1991   Class :character   1st Qu.: 6.000  
##  Mode  :character   Median :1995   Mode  :character   Median : 8.000  
##                     Mean   :1996                      Mean   : 7.428  
##                     3rd Qu.:1999                      3rd Qu.: 9.000  
##                     Max.   :2059                      Max.   :11.000  
##                     NA's   :1161                                      
##     gender               Genre1      
##  Female: 57168    Action    : 30152  
##  Male  :168051    Adventure : 14596  
##                   Comedy    : 30402  
##                   Drama     : 16345  
##                  Other      :133724  
##                                      
##

# Need to fix birthyear and score
Rules <- editfile("Editrules.txt", type = "all")
V_t3<-violatedEdits(Rules,tidy3)
summary(V_t3)

## Edit violations, 225219 observations, 0 completely missing (0%):
## 
##  editname  freq   rel
##      num2 36405 16.2%
##      num1  1716  0.8%
##      num3     3    0%
##      num4     3    0%
## 
## Edit violations per record:
## 
##  errors   freq   rel
##       0 185931 82.6%
##       1  38127 16.9%
##       2   1161  0.5%

Rules1 <- correctionRules('Editrules.txt')
#CorrectwithRules causes Rstudio freezing hence proceed to manual correction
tidy4<-tidy3[(tidy3$birthyear <= 2002) & (tidy3$birthyear >= 1960),]
tidy5<-tidy4[(tidy4$score.y <= 10) & (tidy4$score.y > 0),]
tidy6<-na.omit(tidy5)

Scan II

# This is the R chunk for the Scan II

# Checking for outliers using boxplot(). It can be seen from figures below, User score figure had one likely outlier. However, given the fact that, in the real world, some reviewers do give 1 score in reviews, it was not removed as outlier. Reviews demographic figure on the otherhand clearly showed there are many outliers.    
par(mfrow=c(1,2))
tidy6$birthyear%>% boxplot(main='Reviews Demographic')
tidy6$score.y%>% boxplot(main='User score')

# Z-score test and Tukey method was implemented to identify and remove these outliers in the Reviwers demographic. 
z.scores <-  tidy6$birthyear %>%  scores(type = "z")
z.scores %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.9244 -0.5808  0.1317  0.0000  0.6661  1.5567

length (which( abs(z.scores) >3 ))

## [1] 3216

# subset values that have absolute value of z-scores less than 3
tidy7<-subset(tidy6,subset = abs(z.scores) <3 )
# most outliers that may due to input errors have been removed 
par(mfrow=c(1,2))
tidy7$birthyear%>% boxplot(main='Reviews Demographic')
tidy7$score.y%>% boxplot(main='User score')

Transform

# This is the R chunk for the Transform Section
# Checking for normality using hist().

par(mfrow=c(1,2))
tidy7$birthyear %>% hist(main='Demographic distr.', xlab = 'Year of birth')
tidy7$score.y %>% hist(main='User score distr.', xlab = 'Score 1 to 10')

plot(tidy7$score.y,tidy7$birthyear)

# Unfortunately there is no visible relationship between user score and birthyear

par(mfrow=c(1,2))
birthyear1 <- BoxCox(tidy7$birthyear, lambda = 'auto')
birthyear1 %>% hist(main='Transformed Demographic distr.')

score.y1 <- BoxCox(tidy7$score.y, lambda = 'auto')
score.y1 %>% hist(main='Transformed User score distr.')

# As the two transformed figures below show, the Box-Cox transformation has reduced left skewness in User score distr. but it was less successful in the other one. For statistical investigations, square transformation can be applied to further normalise the Demographic distr. However due to limited computing power (Box-Cox computation took me 3 hrs...) this has not been tested with this data in this assignment.