library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
Two data sets containing information about animes (“Top 10000 Anime Movies, OVA’s and Tv-Shows” and “Anime Database for Recommendation system” sourced from Kaggle.com), were merged together after having being imported as CSV files.
Conversions were undertaken for data type correction, as many variables were incorrectly identified as being numeric instead of integer, and character instead of factor.
The data frame was untidy in nature, due to each variable not having its own column, and each value not having its own cell. As such, this was remedied by separating cell values into their correct variables, forming a tidy data set.
Duplicate variables from the merging of datasets were removed, and the variable names were updated.
After having converted the original date variables into a recognised format, a new variable denoting a given anime’s airing period was created.
The variables of the data frame were reordered to have a more logical and sequential flow.
A variety of strings and symbols, such as “?”, “”, and “No…information”, were being used to denote missing values, and as such were recoded with NA. New values were chosen not to be imputed given the nature of the data. Upon checking, no special values were present in the data.
Data types were re-checked, and an appropriate conversion was made.
Due to the non-normal nature of the numeric variables, Turkey’s Method, using boxplots, was the non-parametric method used to distinguish univariate outliers. Despite the presence of numerous outliers, new values were chosen not to be imputed given the nature of the data.
Data pertaining to the number of viewers of an anime was chosen to be transformed via min-max normalisation. Upon doing so, the scale of viewership became easier to understand, highlighting relative popularity differences between different animes.
The two data sets to be merged are as follows:
(sourced from https://www.kaggle.com/thomaskonstantin/top-10000-anime-movies-ovas-and-tvshows)
(sourced from https://www.kaggle.com/vishalmane109/anime-recommendations-database).
anime1 <- read.csv("C:\\taretae\\uni\\Anime_Top10000.csv",skipNul = TRUE) # Importing CSV
head(anime1) # Provides first six obs
anime2 <- read.csv("C:\\taretae\\uni\\Anime_data.csv") # Importing CSV
head(anime2) # Provides first six obs
anime <- inner_join(anime1,anime2,by = c("Anime_Name"="Title")) # Merging data sets via anime title
anime <- anime %>% arrange(desc(Anime_Rating)) # Ordering data set by descending rating
head(anime) # Provides first six obs
sapply(anime, class) # Provides list of variables with their current data types
## Anime_Name Anime_Episodes Anime_Air_Years Anime_Rating Synopsis.x
## "character" "character" "character" "numeric" "character"
## Anime_id Genre Synopsis.y Type Producer
## "integer" "character" "character" "character" "character"
## Studio Rating ScoredBy Popularity Members
## "character" "numeric" "integer" "numeric" "numeric"
## Episodes Source Aired Link
## "integer" "character" "character" "character"
# Identification of the correct data types for variables with incorrect data types
int_vars <- c(14,15) # Integer variables
fac_vars <- c(7,9,10,11,17) # Factor variables
# Data type conversions
anime[,int_vars] <- lapply(anime[,int_vars],as.integer) # Converts desired vars to integer vars
anime[,fac_vars] <- lapply(anime[,fac_vars],factor) # Converts desired vars to factor vars
sapply(anime, class) # Provides list of variables with their correct data types
## Anime_Name Anime_Episodes Anime_Air_Years Anime_Rating Synopsis.x
## "character" "character" "character" "numeric" "character"
## Anime_id Genre Synopsis.y Type Producer
## "integer" "factor" "character" "factor" "factor"
## Studio Rating ScoredBy Popularity Members
## "factor" "numeric" "integer" "integer" "integer"
## Episodes Source Aired Link
## "integer" "factor" "character" "character"
# Note that an labelled factor will be created further on
The data obtained from the first source (anime1) is untidy in nature. In the anime data frame, columns two and three contain values for multiple variables. Column two (Anime_Episodes) holds both the type of anime (TV show/Movie/OVA), as well as the number of episodes. Column three (Anime_Air_Years) holds when the anime started airing, and when it finished. As the cells for these variables contain more than one value for different variables, tidy data principles have been violated.
anime <- anime %>% separate(Anime_Episodes, into=c("Type","Ep_Count"), sep="\\(") %>% separate(Anime_Air_Years, into=c("Airing_Start","Airing_End"), sep="-") %>% separate(Ep_Count, into=c("Ep_Count","Delete"), sep="e") # Separates Anime_Episodes into two new variables; Type and Ep_Count. Separates Anime_Air_Years into two new variables; Airing_Start and Airing_End. Additionally, separates the unnecessary string "eps" from Ep_Count
head(anime)
anime$Type <- str_trim(anime$Type) # Trims whitespace from character strings
anime$Type <- factor(anime$Type, levels = c("TV","Movie","OVA","ONA","Special","Music")) # Creation of labeled factor variable, regarding type of anime
# Selection of desired variables, removing duplicates formed during merge
anime <- anime %>% select(-c("Delete","Synopsis.y","Rating","Episodes","Aired"))
head(anime)
# Re-naming of variables for better consistency and understanding
colnames(anime) <- c("Title","Type","Ep_Count","Airing_Start","Airing_End","Rating","Synopsis","MAL_ID", "Genre","Producer","Studio","Num_Scorers","Popularity_Ranking","Num_Viewers","Source","Link")
Creation of a variable indicating run time of a given anime, based on the start date and end date of its airing period. Additional creation of a factor variable indicating the season of release of the anime, based on the airing start date.
# Conversion of Airing_Start and Airing_End to date formats
anime$Airing_Start <- parse_date_time(anime$Airing_Start,orders="my")
## Warning: 248 failed to parse.
anime$Airing_End <- parse_date_time(anime$Airing_End,orders="my")
## Warning: 254 failed to parse.
# Creation of a new variable indicating run time (in months)
anime <- mutate(anime, "Runtime_Months" = interval(anime$Airing_Start, anime$Airing_End) %/% months(1))
anime <- anime[,c(1:5,17,15,6,12:14,7,9:11,8,16)] # Reorders data frame into a desired order
head(anime)
Upon inspection of the data, missing values are denoted in the data set by a variety of strings and symbols. They will all be converted to NAs.
The following indicates the incorrect denotation of missing values in variables:
Missing values will not be imputed with any other values, as the data would be detrimentally impacted by the imputation of values or strings (due to inaccurate information being shown for a given anime).
Upon additional scanning, there does not appear to be any special values, or any more inconsistencies. Variable data types are to be checked again and remedied.
# Duplicating the data frame to further wrangle
anime_scan <- anime
# Recoding missing values as NA
anime_scan$Ep_Count[anime_scan$Ep_Count == "? "] <- NA # Fills NAs for Ep_Count
anime_scan$Source[anime_scan$Source == ""] <- NA # Fills NAs for Source
anime_scan$Num_Viewers[anime_scan$Num_Viewers == "0"] <- NA # Fills NAs for Num_Viewers
anime_scan$Synopsis[anime_scan$Synopsis == "No synopsis information has been added to this title. Help improve our database by adding a synopsis here."] <- NA # Fills NAs for Synopsis
anime_scan$Genre[anime_scan$Genre == ""] <- NA # Fills NAs for Genre
anime_scan$Producer[anime_scan$Producer == ""] <- NA # Fills NAs for Producer
anime_scan$Studio[anime_scan$Studio == ""] <- NA # Fills NAs for Studio
anime_scan$Link[anime_scan$Link == ""] <- NA # Fills NAs for Link
colSums(is.na(anime_scan))
## Title Type Ep_Count Airing_Start
## 0 0 36 253
## Airing_End Runtime_Months Source Rating
## 305 310 1159 0
## Num_Scorers Popularity_Ranking Num_Viewers Synopsis
## 1832 106 411 102
## Genre Producer Studio MAL_ID
## 1180 4616 3292 0
## Link
## 106
sapply(anime_scan, function(x) sum(is.infinite(x), is.nan(x))) # Indicates that there are no special values
## Title Type Ep_Count Airing_Start
## 0 0 0 0
## Airing_End Runtime_Months Source Rating
## 0 0 0 0
## Num_Scorers Popularity_Ranking Num_Viewers Synopsis
## 0 0 0 0
## Genre Producer Studio MAL_ID
## 0 0 0 0
## Link
## 0
sapply(anime_scan, class) # Provides list of variables with their current data types
## $Title
## [1] "character"
##
## $Type
## [1] "factor"
##
## $Ep_Count
## [1] "character"
##
## $Airing_Start
## [1] "POSIXct" "POSIXt"
##
## $Airing_End
## [1] "POSIXct" "POSIXt"
##
## $Runtime_Months
## [1] "numeric"
##
## $Source
## [1] "factor"
##
## $Rating
## [1] "numeric"
##
## $Num_Scorers
## [1] "integer"
##
## $Popularity_Ranking
## [1] "integer"
##
## $Num_Viewers
## [1] "integer"
##
## $Synopsis
## [1] "character"
##
## $Genre
## [1] "factor"
##
## $Producer
## [1] "factor"
##
## $Studio
## [1] "factor"
##
## $MAL_ID
## [1] "integer"
##
## $Link
## [1] "character"
# Data type conversion
anime_scan[,c(3,6)] <- lapply(anime_scan[,c(3,6)],as.integer) # Converts desired vars to integer vars
sapply(anime_scan, class) # Provides list of variables with their correct data types
## $Title
## [1] "character"
##
## $Type
## [1] "factor"
##
## $Ep_Count
## [1] "integer"
##
## $Airing_Start
## [1] "POSIXct" "POSIXt"
##
## $Airing_End
## [1] "POSIXct" "POSIXt"
##
## $Runtime_Months
## [1] "integer"
##
## $Source
## [1] "factor"
##
## $Rating
## [1] "numeric"
##
## $Num_Scorers
## [1] "integer"
##
## $Popularity_Ranking
## [1] "integer"
##
## $Num_Viewers
## [1] "integer"
##
## $Synopsis
## [1] "character"
##
## $Genre
## [1] "factor"
##
## $Producer
## [1] "factor"
##
## $Studio
## [1] "factor"
##
## $MAL_ID
## [1] "integer"
##
## $Link
## [1] "character"
# Histograms to determine if distributions are normal
par(mfrow=c(2,2))
hist(anime_scan$Ep_Count, main="Boxplot of Episode Count")
hist(anime_scan$Runtime_Months, main="Boxplot of Run Time (In Months)")
hist(anime_scan$Num_Scorers, main="Boxplot of the Number of Scorers")
hist(anime_scan$Num_Viewers, main="Boxplot of the Number of Viewers")
# As distributions are not normal, a non-parametric outlier test must be used
par(mfrow=c(2,2))
boxplot(anime_scan$Ep_Count, main="Boxplot of Episode Count")
boxplot(anime_scan$Runtime_Months, main="Boxplot of Run Time (In Months)")
boxplot(anime_scan$Num_Scorers, main="Boxplot of the Number of Scorers")
boxplot(anime_scan$Num_Viewers, main="Boxplot of the Number of Viewers")
Upon inspection of the boxplots, eachnumeric variable has a large number of outliers. However, due to the nature of the data, they are not due to measuring errors or such, but are instead genuine data points that reflect the inflated popularity, viewership, airing time, and episode count of certain animes.
Min-Max Normalisation is applied to the number of viewers (Num_Viewers) of each anime. This aims to normalise the popularity of different anime, with a 0 indicating almost no popularity, and 1 indicating the most popular show. This makes it easier to compare the popularity of animes relative to each other. This is the intended purpose of the transformation.
# Duplicating the data frame to further wrangle
anime_trans <- anime_scan
# Min Max Normalisation
minmaxnormalise <- function(x) {(x-min(x,na.rm = T))/(max(x,na.rm = T)-min(x, na.rm = T))}
# Add transformed data as an additional variable
anime_trans <- mutate(anime_trans, "Normalised_Viewership" = minmaxnormalise(anime_trans$Num_Viewers))
anime_trans$Normalised_Viewership <- round(anime_trans$Normalised_Viewership, digits = 5) # Rounding the normalised viewership to 5 decimal places for easier comprehension
anime_trans <- anime_trans[,c(1:11,18,12:17)] # Reorders data frame into a desired order
head(anime_trans)