Data Wrangling (Data Preprocessing)

Required packages

library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)

Executive Summary

Two data sets containing information about animes (“Top 10000 Anime Movies, OVA’s and Tv-Shows” and “Anime Database for Recommendation system” sourced from Kaggle.com), were merged together after having being imported as CSV files.

Conversions were undertaken for data type correction, as many variables were incorrectly identified as being numeric instead of integer, and character instead of factor.

The data frame was untidy in nature, due to each variable not having its own column, and each value not having its own cell. As such, this was remedied by separating cell values into their correct variables, forming a tidy data set.

Duplicate variables from the merging of datasets were removed, and the variable names were updated.

After having converted the original date variables into a recognised format, a new variable denoting a given anime’s airing period was created.

The variables of the data frame were reordered to have a more logical and sequential flow.

A variety of strings and symbols, such as “?”, “”, and “No…information”, were being used to denote missing values, and as such were recoded with NA. New values were chosen not to be imputed given the nature of the data. Upon checking, no special values were present in the data.

Data types were re-checked, and an appropriate conversion was made.

Due to the non-normal nature of the numeric variables, Turkey’s Method, using boxplots, was the non-parametric method used to distinguish univariate outliers. Despite the presence of numerous outliers, new values were chosen not to be imputed given the nature of the data.

Data pertaining to the number of viewers of an anime was chosen to be transformed via min-max normalisation. Upon doing so, the scale of viewership became easier to understand, highlighting relative popularity differences between different animes.

Data

The two data sets to be merged are as follows:

“Top 10000 Anime Movies, OVA’s and Tv-Shows” (April 2021)

(sourced from https://www.kaggle.com/thomaskonstantin/top-10000-anime-movies-ovas-and-tvshows)

Which contains the variables:

Anime_Name: Official title of the anime
Anime_Episodes: Type of anime e.g. OVA (original video animation)/Movie/TV Show, and the number of episodes
Anime_Air_Years: Start and end date of the anime’s airing period
Anime_Rating: Rating of anime (out of 10) as per myanimelist.net
Synopsis: Brief description of the anime’s premise

“Anime Database for Recommendation system” (July 2020)

(sourced from https://www.kaggle.com/vishalmane109/anime-recommendations-database).

Which contains the variables:

Anime_id: ID value of the anime on myanimelist.net
Title: Official title of the anime
Genre: Main genres
Synopsis: Brief description of the anime’s premise
Type: Type of anime e.g. ONA (original net animation)/Movie/TV Show
Producer: Producer of the anime
Studio: Animation studio that created the anime
Rating: Rating of anime (out of 10) as per myanimelist.net
ScoredBy: Total number of users that rated the anime
Popularity: Rank of the anime, based on popularity
Members: Number of myanimelist users that have seen the anime
Episodes: Number of episodes
Source: Source of the anime content e.g. Manga, Novel, Original
Aired: Start and end date of the anime’s airing period
Link: URL to the anime’s myanimelist page

anime1 <- read.csv("C:\\taretae\\uni\\Anime_Top10000.csv",skipNul = TRUE) # Importing CSV
head(anime1) # Provides first six obs

anime2 <- read.csv("C:\\taretae\\uni\\Anime_data.csv") # Importing CSV
head(anime2) # Provides first six obs

anime <- inner_join(anime1,anime2,by = c("Anime_Name"="Title")) # Merging data sets via anime title
anime <- anime %>% arrange(desc(Anime_Rating)) # Ordering data set by descending rating
head(anime) # Provides first six obs

Understand

sapply(anime, class) # Provides list of variables with their current data types

##      Anime_Name  Anime_Episodes Anime_Air_Years    Anime_Rating      Synopsis.x 
##     "character"     "character"     "character"       "numeric"     "character" 
##        Anime_id           Genre      Synopsis.y            Type        Producer 
##       "integer"     "character"     "character"     "character"     "character" 
##          Studio          Rating        ScoredBy      Popularity         Members 
##     "character"       "numeric"       "integer"       "numeric"       "numeric" 
##        Episodes          Source           Aired            Link 
##       "integer"     "character"     "character"     "character"

# Identification of the correct data types for variables with incorrect data types
int_vars <- c(14,15) # Integer variables
fac_vars <- c(7,9,10,11,17) # Factor variables

# Data type conversions
anime[,int_vars] <- lapply(anime[,int_vars],as.integer) # Converts desired vars to integer vars
anime[,fac_vars] <- lapply(anime[,fac_vars],factor) # Converts desired vars to factor vars

sapply(anime, class) # Provides list of variables with their correct data types

##      Anime_Name  Anime_Episodes Anime_Air_Years    Anime_Rating      Synopsis.x 
##     "character"     "character"     "character"       "numeric"     "character" 
##        Anime_id           Genre      Synopsis.y            Type        Producer 
##       "integer"        "factor"     "character"        "factor"        "factor" 
##          Studio          Rating        ScoredBy      Popularity         Members 
##        "factor"       "numeric"       "integer"       "integer"       "integer" 
##        Episodes          Source           Aired            Link 
##       "integer"        "factor"     "character"     "character"

# Note that an labelled factor will be created further on

Tidy & Manipulate Data I

The data obtained from the first source (anime1) is untidy in nature. In the anime data frame, columns two and three contain values for multiple variables. Column two (Anime_Episodes) holds both the type of anime (TV show/Movie/OVA), as well as the number of episodes. Column three (Anime_Air_Years) holds when the anime started airing, and when it finished. As the cells for these variables contain more than one value for different variables, tidy data principles have been violated.

anime <- anime %>% separate(Anime_Episodes, into=c("Type","Ep_Count"), sep="\\(") %>% separate(Anime_Air_Years, into=c("Airing_Start","Airing_End"), sep="-") %>% separate(Ep_Count, into=c("Ep_Count","Delete"), sep="e") # Separates Anime_Episodes  into two new variables; Type and Ep_Count. Separates Anime_Air_Years  into two new variables; Airing_Start and Airing_End. Additionally, separates the unnecessary string "eps" from Ep_Count
head(anime)

anime$Type <- str_trim(anime$Type) # Trims whitespace from character strings
anime$Type <- factor(anime$Type, levels = c("TV","Movie","OVA","ONA","Special","Music")) # Creation of labeled factor variable, regarding type of anime

# Selection of desired variables, removing duplicates formed during merge
anime <- anime %>% select(-c("Delete","Synopsis.y","Rating","Episodes","Aired"))
head(anime)

# Re-naming of variables for better consistency and understanding
colnames(anime) <- c("Title","Type","Ep_Count","Airing_Start","Airing_End","Rating","Synopsis","MAL_ID", "Genre","Producer","Studio","Num_Scorers","Popularity_Ranking","Num_Viewers","Source","Link")

Tidy & Manipulate Data II

Creation of a variable indicating run time of a given anime, based on the start date and end date of its airing period. Additional creation of a factor variable indicating the season of release of the anime, based on the airing start date.

# Conversion of Airing_Start and Airing_End to date formats
anime$Airing_Start <- parse_date_time(anime$Airing_Start,orders="my")

## Warning: 248 failed to parse.

anime$Airing_End <- parse_date_time(anime$Airing_End,orders="my")

## Warning: 254 failed to parse.

# Creation of a new variable indicating run time (in months)
anime <- mutate(anime, "Runtime_Months" = interval(anime$Airing_Start, anime$Airing_End) %/% months(1))
anime <- anime[,c(1:5,17,15,6,12:14,7,9:11,8,16)] # Reorders data frame into a desired order
head(anime)

Scan I

Upon inspection of the data, missing values are denoted in the data set by a variety of strings and symbols. They will all be converted to NAs.

The following indicates the incorrect denotation of missing values in variables:

Ep_Count: “?”
Source: ""
Num_Viewers: “0” to be replaced with NA, as for animes to be on this list they must have been viewed at least once (inconsistency in the data)
Synopsis: “No synopsis information has been added to this title. Help improve our database by adding a synopsis here.”
Genre: ""
Producer: ""
Studio: ""
Link: ""

Missing values will not be imputed with any other values, as the data would be detrimentally impacted by the imputation of values or strings (due to inaccurate information being shown for a given anime).

Upon additional scanning, there does not appear to be any special values, or any more inconsistencies. Variable data types are to be checked again and remedied.

# Duplicating the data frame to further wrangle
anime_scan <- anime

# Recoding missing values as NA
anime_scan$Ep_Count[anime_scan$Ep_Count == "? "] <- NA # Fills NAs for Ep_Count

anime_scan$Source[anime_scan$Source == ""] <- NA # Fills NAs for Source

anime_scan$Num_Viewers[anime_scan$Num_Viewers == "0"] <- NA # Fills NAs for Num_Viewers

anime_scan$Synopsis[anime_scan$Synopsis == "No synopsis information has been added to this title. Help improve our database by adding a synopsis here."] <- NA # Fills NAs for Synopsis

anime_scan$Genre[anime_scan$Genre == ""] <- NA # Fills NAs for Genre

anime_scan$Producer[anime_scan$Producer == ""] <- NA # Fills NAs for Producer

anime_scan$Studio[anime_scan$Studio == ""] <- NA # Fills NAs for Studio

anime_scan$Link[anime_scan$Link == ""] <- NA # Fills NAs for Link

colSums(is.na(anime_scan))

##              Title               Type           Ep_Count       Airing_Start 
##                  0                  0                 36                253 
##         Airing_End     Runtime_Months             Source             Rating 
##                305                310               1159                  0 
##        Num_Scorers Popularity_Ranking        Num_Viewers           Synopsis 
##               1832                106                411                102 
##              Genre           Producer             Studio             MAL_ID 
##               1180               4616               3292                  0 
##               Link 
##                106

sapply(anime_scan, function(x) sum(is.infinite(x), is.nan(x))) # Indicates that there are no special values

##              Title               Type           Ep_Count       Airing_Start 
##                  0                  0                  0                  0 
##         Airing_End     Runtime_Months             Source             Rating 
##                  0                  0                  0                  0 
##        Num_Scorers Popularity_Ranking        Num_Viewers           Synopsis 
##                  0                  0                  0                  0 
##              Genre           Producer             Studio             MAL_ID 
##                  0                  0                  0                  0 
##               Link 
##                  0

sapply(anime_scan, class) # Provides list of variables with their current data types

## $Title
## [1] "character"
## 
## $Type
## [1] "factor"
## 
## $Ep_Count
## [1] "character"
## 
## $Airing_Start
## [1] "POSIXct" "POSIXt" 
## 
## $Airing_End
## [1] "POSIXct" "POSIXt" 
## 
## $Runtime_Months
## [1] "numeric"
## 
## $Source
## [1] "factor"
## 
## $Rating
## [1] "numeric"
## 
## $Num_Scorers
## [1] "integer"
## 
## $Popularity_Ranking
## [1] "integer"
## 
## $Num_Viewers
## [1] "integer"
## 
## $Synopsis
## [1] "character"
## 
## $Genre
## [1] "factor"
## 
## $Producer
## [1] "factor"
## 
## $Studio
## [1] "factor"
## 
## $MAL_ID
## [1] "integer"
## 
## $Link
## [1] "character"

# Data type conversion
anime_scan[,c(3,6)] <- lapply(anime_scan[,c(3,6)],as.integer) # Converts desired vars to integer vars

sapply(anime_scan, class) # Provides list of variables with their correct data types

## $Title
## [1] "character"
## 
## $Type
## [1] "factor"
## 
## $Ep_Count
## [1] "integer"
## 
## $Airing_Start
## [1] "POSIXct" "POSIXt" 
## 
## $Airing_End
## [1] "POSIXct" "POSIXt" 
## 
## $Runtime_Months
## [1] "integer"
## 
## $Source
## [1] "factor"
## 
## $Rating
## [1] "numeric"
## 
## $Num_Scorers
## [1] "integer"
## 
## $Popularity_Ranking
## [1] "integer"
## 
## $Num_Viewers
## [1] "integer"
## 
## $Synopsis
## [1] "character"
## 
## $Genre
## [1] "factor"
## 
## $Producer
## [1] "factor"
## 
## $Studio
## [1] "factor"
## 
## $MAL_ID
## [1] "integer"
## 
## $Link
## [1] "character"

Scan II

# Histograms to determine if distributions are normal
par(mfrow=c(2,2))
hist(anime_scan$Ep_Count, main="Boxplot of Episode Count")
hist(anime_scan$Runtime_Months, main="Boxplot of Run Time (In Months)")
hist(anime_scan$Num_Scorers, main="Boxplot of the Number of Scorers")
hist(anime_scan$Num_Viewers, main="Boxplot of the Number of Viewers")

# As distributions are not normal, a non-parametric outlier test must be used
par(mfrow=c(2,2))
boxplot(anime_scan$Ep_Count, main="Boxplot of Episode Count")
boxplot(anime_scan$Runtime_Months, main="Boxplot of Run Time (In Months)")
boxplot(anime_scan$Num_Scorers, main="Boxplot of the Number of Scorers")
boxplot(anime_scan$Num_Viewers, main="Boxplot of the Number of Viewers")

Upon inspection of the boxplots, eachnumeric variable has a large number of outliers. However, due to the nature of the data, they are not due to measuring errors or such, but are instead genuine data points that reflect the inflated popularity, viewership, airing time, and episode count of certain animes.

Transform

Min-Max Normalisation is applied to the number of viewers (Num_Viewers) of each anime. This aims to normalise the popularity of different anime, with a 0 indicating almost no popularity, and 1 indicating the most popular show. This makes it easier to compare the popularity of animes relative to each other. This is the intended purpose of the transformation.

# Duplicating the data frame to further wrangle
anime_trans <- anime_scan

# Min Max Normalisation
minmaxnormalise <- function(x) {(x-min(x,na.rm = T))/(max(x,na.rm = T)-min(x, na.rm = T))}

# Add transformed data as an additional variable
anime_trans <- mutate(anime_trans, "Normalised_Viewership" = minmaxnormalise(anime_trans$Num_Viewers))

anime_trans$Normalised_Viewership <- round(anime_trans$Normalised_Viewership, digits = 5) # Rounding the normalised viewership to 5 decimal places for easier comprehension
anime_trans <- anime_trans[,c(1:11,18,12:17)] # Reorders data frame into a desired order
head(anime_trans)

Data Wrangling (Data Preprocessing)

Practical assessment 2

Jasmine Farrugia s3845303

Required packages

Executive Summary

Data

“Top 10000 Anime Movies, OVA’s and Tv-Shows” (April 2021)

Which contains the variables:

“Anime Database for Recommendation system” (July 2020)

Which contains the variables:

Understand

Tidy & Manipulate Data I

Tidy & Manipulate Data II

Scan I

Scan II

Transform